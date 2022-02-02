A Node.js scraper for humans.

Want to save time or not using Node.js? Try our hosted API.

☁️ Installation

npm install --save scrape-it yarn add scrape-it

💡 ProTip: You can install the cli version of this module by running npm install --global scrape-it-cli (or yarn global add scrape-it-cli ).

FAQ

Here are some frequent questions and their answers.

1. How to parse scrape pages?

scrape-it has only a simple request module for making requests. That means you cannot directly parse ajax pages with it, but in general you will have those scenarios:

The ajax response is in JSON format. In this case, you can make the request directly, without needing a scraping library. The ajax response gives you HTML back. Instead of calling the main website (e.g. example.com), pass to scrape-it the ajax url (e.g. example.com/api/that-endpoint ) and you will you will be able to parse the response The ajax request is so complicated that you don't want to reverse-engineer it. In this case, use a headless browser (e.g. Google Chrome, Electron, PhantomJS) to load the content and then use the .scrapeHTML method from scrape it once you get the HTML loaded on the page.

2. Crawling

There is no fancy way to crawl pages with scrape-it . For simple scenarios, you can parse the list of urls from the initial page and then, using Promises, parse each page. Also, you can use a different crawler to download the website and then use the .scrapeHTML method to scrape the local files.

3. Local files

Use the .scrapeHTML to parse the HTML read from the local files using fs.readFile .

📋 Example

const scrapeIt = require ( "scrape-it" ) scrapeIt( "https://ionicabizau.net" , { title : ".header h1" , desc : ".header h2" , avatar : { selector : ".header img" , attr : "src" } }).then( ( { data, response } ) => { console .log( `Status Code: ${response.statusCode} ` ) console .log(data) }) scrapeIt( "https://ionicabizau.net" , { articles : { listItem : ".article" , data : { createdAt : { selector : ".date" , convert : x => new Date (x) } , title : "a.article-title" , tags : { listItem : ".tags > span" } , content : { selector : ".article-content" , how : "html" } , classes : { attr : "class" } } } , pages : { listItem : "li.page" , name : "pages" , data : { title : "a" , url : { selector : "a" , attr : "href" } } } , title : ".header h1" , desc : ".header h2" , avatar : { selector : ".header img" , attr : "src" } }, (err, { data }) => { console .log(err || data) })

❓ Get Help

There are few ways to get help:

Please post questions on Stack Overflow. You can open issues with questions, as long you add a link to your Stack Overflow question. For bug reports and feature requests, open issues. 🐛 For direct and quick help, you can use Codementor. 🚀

📝 Documentation

A scraping module for humans.

Params

String|Object url : The page url or request options.

: The page url or request options. Object opts : The options passed to scrapeHTML method.

: The options passed to method. Function cb : The callback function.

Return

Promise A promise object resolving with: data (Object): The scraped data. $ (Function): The Cheeerio function. This may be handy to do some other manipulation on the DOM, if needed. response (Object): The response object. body (String): The raw body as a string.

A promise object resolving with:

Scrapes the data in the provided element.

For the format of the selector, please refer to the Selectors section of the Cheerio library

Params

Cheerio $ : The input element.

Object opts : An object containing the scraping information. If you want to scrape a list, you have to use the listItem selector: listItem (String): The list item selector. data (Object): The fields to include in the list objects: <fieldName> (Object|String): The selector or an object containing: selector (String): The selector. convert (Function): An optional function to change the value. how (Function|String): A function or function name to access the value. attr (String): If provided, the value will be taken based on the attribute name. trim (Boolean): If false , the value will not be trimmed (default: true ). closest (String): If provided, returns the first ancestor of the given element. eq (Number): If provided, it will select the nth element. texteq (Number): If provided, it will select the nth direct text child. Deep text child selection is not possible yet. Overwrites the how key. listItem (Object): An object, keeping the recursive schema of the listItem object. This can be used to create nested lists. Example : { articles : { listItem : ".article" , data : { createdAt : { selector : ".date" , convert : x => new Date (x) } , title : "a.article-title" , tags : { listItem : ".tags > span" } , content : { selector : ".article-content" , how : "html" } , traverseOtherNode : { selector : ".upperNode" , closest : "div" , convert : x => x.length } } } } If you want to collect specific data from the page, just use the same schema used for the data field. Example : { title : ".header h1" , desc : ".header h2" , avatar : { selector : ".header img" , attr : "src" } }



Return

Object The scraped data.

😋 How to contribute

Have an idea? Found a bug? See how to contribute.

💖 Support my projects

I open-source almost everything I can, and I try to reply to everyone needing help using these projects. Obviously, this takes time. You can integrate and use these projects in your applications for free! You can even change the source code and redistribute (even resell it).

However, if you get some profit from this or just want to encourage me to continue creating stuff, there are few ways you can do it:

Starring and sharing the projects you like 🚀

—I love books! I will remember you after years if you buy me one. 😁 📖

—You can make one-time donations via PayPal. I'll probably buy a coffee tea. 🍵

—Set up a recurring monthly donation and you will get interesting news about what I'm doing (things that I don't share with everyone).

Bitcoin—You can send me bitcoins at this address (or scanning the code below): 1P9BRsmazNQcuyTxEqveUsnf5CERdq35V6

Thanks! ❤️

💫 Where is this library used?

If you are using this library in one of your projects, add it in this list. ✨

@web-master/node-web-scraper

proxylist

mit-ocw-scraper

beervana-scraper

cnn-market

bandcamp-scraper

@tryghost/mg-webscraper

blockchain-notifier

dncli

degusta-scrapper

trump-cabinet-picks

cevo-lookup

camaleon

scrape-vinmonopolet

do-fn

university-news-notifier

selfrefactor

parn

picarto-lib

mix-dl

jishon

sahibinden

sahibindenServer

sgdq-collector

ubersetzung

ui-studentsearch

paklek-cli

egg-crawler

@thetrg/gibson

jobs-fetcher

fmgo-marketdata

rayko-tools

leximaven

codinglove-scraper

vandalen.rhyme.js

uniwue-lernplaetze-scraper

spon-market

macoolka-net-scrape

gatsby-source-bandcamp

salesforcerelease-parser

yu-ncov-scrape-dxy

rs-api

startpage-quick-search

helyesiras

covidau

3abn

scrape-it-cli

codementor

u-pull-it-ne-parts-finder

blankningsregistret

scrapos-worker

@ben-wormald/bandcamp-scraper

bible-scraper

flamescraper

fa.js

growapi

node-red-contrib-scrape-it

carirs

steam-workshop-scraper

macoolka-network

apixpress

📜 License

MIT © Ionică Bizău