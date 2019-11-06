Install

npm install huntsman --save

Example Script

var huntsman = require ( 'huntsman' ); var spider = huntsman.spider(); spider.extensions = [ huntsman.extension( 'recurse' ), huntsman.extension( 'cheerio' ) ]; spider.on( /http:\/\/en\.wikipedia\.org\/wiki\/\w+:\w+$/ , function ( err, res ) { var $ = res.extension.cheerio; if ( !$ ) return ; var wikipedia = { uri : res.uri, heading : $( 'h1.firstHeading' ).text().trim(), body : $( 'div#mw-content-text p' ).text().trim() }; console .log( wikipedia ); }); spider.queue.add( 'http://en.wikipedia.org/wiki/Huntsman_spider' ); spider.start();

Example Output

peter@edgy:/tmp$ node examples/html.js { "uri" : "http://en.wikipedia.org/wiki/Wikipedia:Recent_additions" , "heading" : "Wikipedia:Recent additions" , "body" : "This is a selection of recently created new articles and greatly expanded former stub articles on Wikipedia that were featured on the Main Page as part of Did you know? You can submit new pages for consideration. (Archives are grouped by month of Main page appearance.)Tip: To find which archive contains the fact that appeared on Did You Know?, return to the article and click \"What links here\" to the left of the article. Then, in the dropdown menu provided for namespace, choose Wikipedia and click \"Go\". When you find \"Wikipedia:Recent additions\" and a number, click it and search for the article name.



Current archive" } ... etc

More examples are available in the /examples directory

How it works

Huntsman takes one or more 'seed' urls with the spider.queue.add() method.

Once the process is kicked off with spider.start() , it will take care of extracting links from the page and following only the pages we want.

To define which pages are crawled use the spider.on() function with a string or regular expression.

Each page will only be crawled once. If multiple regular expressions match the uri, they will all be called.

Page URLs which do not match an on condition will never be crawled

Configuration

The spider has default settings, you can override them by passing a settings object when you create a spider.

var huntsman = require ( 'huntsman' ); var spider = huntsman.spider();

var huntsman = require ( 'huntsman' ); var spider = huntsman.spider({ throttle : 10 , timeout : 5000 });

Crawling a site

How you configure your spider will vary from site to site, generally you will only be looking for for pages with a specific url format.

Scrape product information from amazon

In this example we can see that amazon product uris all seem to share the format '/gp/product/' .

After queueing the seed uri http://www.amazon.co.uk/ huntsman will follow all the product pages it finds recursively.

var huntsman = require ( 'huntsman' ); var spider = huntsman.spider(); spider.extensions = [ huntsman.extension( 'recurse' ), huntsman.extension( 'cheerio' ) ]; spider.on( '/gp/product/' , function ( err, res ) { if ( !res.extension.cheerio ) return ; var $ = res.extension.cheerio; var product = { uri : res.uri, heading : $( 'h1.parseasinTitle' ).text().trim(), image : $( 'img#main-image' ).attr( 'src' ), description : $( '#productDescription' ).text().trim().substr( 0 , 50 ) }; console .log( product ); }); spider.queue.add( 'http://www.amazon.co.uk/' ); spider.start();

Find pets for sale on craigslist in london

More complex crawls may require you to specify hub pages to follow before you can get to the content you really want. You can add an on event without a callback & huntsman will still follow and extract links from it.

var huntsman = require ( 'huntsman' ); var spider = huntsman.spider({ throttle : 2 }); spider.extensions = [ huntsman.extension( 'recurse' ), huntsman.extension( 'cheerio' ), huntsman.extension( 'stats' ) ]; spider.on( /\/pet\/(\w+)\.html$/ , function ( err, res ) { if ( !res.extension.cheerio ) return ; var $ = res.extension.cheerio; var listing = { heading : $( 'h2.postingtitle' ).text().trim(), uri : res.uri, image : $( 'img#iwi' ).attr( 'src' ), description : $( '#postingbody' ).text().trim().substr( 0 , 50 ) }; console .log( listing ); }); spider.on( /http:\/\/london\.craigslist\.co\.uk$/ ); spider.on( /\/pet$/ ); spider.queue.add( 'http://www.craigslist.org/about/sites' ); spider.start();

Extensions

Extensions have default settings, you can override them by passing an optional second argument when the extension is loaded.

spider.extensions = [ huntsman.extension( 'extension_name' , options ) ];

recurse

This extension extracts links from html pages and then adds them to the queue.

The default patterns only target anchor tags which use the http protocol, you can change any of the default patterns by declaring them when the extension is loaded.

huntsman.extension( 'recurse' , { pattern : { search : /a([^>]+)href\s?=\s?['"]([^"'#]+)/gi , refine : /['"]([^"'#]+)$/ , filter : /^https?:\/\// } })

search must be a global regexp and is used to target the links we want to extract.

must be a regexp and is used to target the links we want to extract. refine is a regexp used to extract the bits we want from the search regex matches.

is a regexp used to extract the bits we want from the regex matches. filter is a regexp that must match or links are discarded.

huntsman.extension( 'recurse' , { pattern : { search : /(a([^>]+)href|script([^>]+)src)\s?=\s?['"]([^"'#]+)/gi , } })

huntsman.extension( 'recurse' , { pattern : { search : /a([^>]+)href\s?=\s?['"]([^"'#\?]+)/gi } })

huntsman.extension( 'recurse' , { pattern : { filter : /^https?:\/\/.*(?!\.(pdf|png|jpg|gif|zip))....$/i , } })

huntsman.extension( 'recurse' , { pattern : { filter : /^https?:\/\/.*(?!\.\w{3})....$/ , } })

huntsman.extension( 'recurse' , { pattern : { filter : /^https?:\/\/www\.example\.com/i , } })

By default recurse converts relative urls to absolute urls and strips fragment identifiers and trailing slashes.

If you need even more control you can override the resolver & normaliser functions to modify these behaviours.

cheerio

This extension parses html and provides jquery-style selectors & functions.

huntsman.extension( 'cheerio' , { lowerCaseTags : true } )

The res.extension.cheerio function is available in your on callbacks when the response body is HTML.

spider.on( 'example.com' , function ( err, res ) { var $ = res.extension.cheerio; if ( !$ ) return ; console .log( res.uri, $( 'h1' ).text().trim() ); });

cheerio reference: https://github.com/MatthewMueller/cheerio

json

This extension parses the response body with JSON.parse() .

huntsman.extension( 'json' )

The res.extension.json function is available in your on callbacks when the response body is json.

spider.on( 'example.com' , function ( err, res ) { var json = res.extension.json; if ( !json ) return ; console .log( res.uri, json ); });

This extension extracts links from html pages and returns the result.

It exposes the same functionality that the recurse extension uses to extract links.

huntsman.extension( 'links' )

The res.extension.links function is available in your on callbacks when the response body is a string.

spider.on( 'example.com' , function ( err, res ) { if ( !res.extension.links ) return ; var images = res.extension.links({ pattern : { search : /(img([^>]+)src)\s?=\s?['"]([^"'#]+)/gi , filter : /\.jpg|\.gif|\.png/i } }); console .log( images ); });

stats

This extension displays statistics about pages crawled, error counts etc.

huntsman.extension( 'stats' , { tail : false } )

Custom queues and response storage adapters

I'm currently working on being able to persist the job queue via something like redis and potentially caching http responses in mongo with a TTL.

If you live life on the wild side, these adapters can be configured when you create a spider.

Pull requests welcome.

License

(The MIT License)

Copyright (c) 2013 Peter Johnson <@insertcoffee>

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the 'Software'), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED 'AS IS', WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.