Installation

$ npm install article-parser $ pnpm install article-parser $ yarn add article-parser

Usage

const { extract } = require ( 'article-parser' ) import { extract } from 'article-parser' const url = 'https://dev.to/ndaidong/how-to-make-your-mongodb-container-more-secure-1646' extract(url).then( ( article ) => { console .log(article) }).catch( ( err ) => { console .trace(err) })

Result:

{ url : 'https://dev.to/ndaidong/how-to-make-your-mongodb-container-more-secure-1646' , title : 'How to make your MongoDB container more secure?' , description : 'Start it with docker The most simple way to get MongoDB instance in your machine is using...' , links : [ 'https://dev.to/ndaidong/how-to-make-your-mongodb-container-more-secure-1646' ], image : 'https://res.cloudinary.com/practicaldev/image/fetch/s--qByI1v3K--/c_imagga_scale,f_auto,fl_progressive,h_500,q_auto,w_1000/https://dev-to-uploads.s3.amazonaws.com/i/p4sfysev3s1jhw2ar2bi.png' , content : '...' , author : '@ndaidong' , source : 'dev.to' , published : '' , ttr : 162 }

APIs

extract(String url | String html)

Load and extract article data. Return a Promise object.

Example:

const { extract } = require ( 'article-parser' ) const getArticle = async (url) => { try { const article = await extract(url) return article } catch (err) { console .trace(err) return null } } getArticle( 'https://domain.com/path/to/article' )

If the extraction works well, you should get an article object with the structure as below:

{ "url" : URI String, "title" : String, "description" : String, "image" : URI String, "author" : String, "content" : HTML String, "published" : Date String, "source" : String, "links" : Array, "ttr" : Number, }

Add custom rules to get main article from the specific domains.

This can be useful when the default extraction algorithm fails, or when you want to remove some parts of main article content.

Example:

const { addQueryRules, extract } = require ( 'article-parser' ) extract( 'https://bad-website.domain/page/article' ) addQueryRules([ { patterns : [ /http(s?):\/\/bad-website.domain\/*/ ], selector : '#noop_article_locates_here' , unwanted : [ '.advertise-area' , '.stupid-banner' ] } ]) extract( 'https://bad-website.domain/page/article' )

While adding rules, you can specify a transform() function to fine-tune article content more thoroughly.

Example rule with transformation:

const { addQueryRules } = require ( 'article-parser' ) addQueryRules([ { patterns : [ /http(s?):\/\/bad-website.domain\/*/ ], selector : '#article_id_here' , transform : ( $ ) => { $( 'h1' ).replaceWith( function ( ) { const h1Html = $( this ).html() return `<b> ${h1Html} </b>` }) return $ } } ])

Please refer cheerio's docs for more info.

Configuration methods

In addition, this lib provides some methods to customize default settings. Don't touch them unless you have reason to do that.

getParserOptions()

setParserOptions(Object parserOptions)

getRequestOptions()

setRequestOptions(Object requestOptions)

getSanitizeHtmlOptions()

setSanitizeHtmlOptions(Object sanitizeHtmlOptions)

Here are default properties/values:

Object parserOptions :

{ wordsPerMinute : 300 , urlsCompareAlgorithm : 'levenshtein' , descriptionLengthThreshold : 40 , descriptionTruncateLen : 156 , contentLengthThreshold : 200 }

Read string-comparison docs for more info about urlsCompareAlgorithm .

Object requestOptions :

{ headers : { 'user-agent' : 'Mozilla/5.0 (X11; Linux x86_64; rv:95.0) Gecko/20100101 Firefox/95.0' , accept : 'text/html; charset=utf-8' }, responseType : 'text' , responseEncoding : 'utf8' , timeout : 6e4 , maxRedirects : 3 }

Read axios' request config for more info.

Object sanitizeHtmlOptions :

{ allowedTags : [ 'h1' , 'h2' , 'h3' , 'h4' , 'h5' , 'u' , 'b' , 'i' , 'em' , 'strong' , 'small' , 'sup' , 'sub' , 'div' , 'span' , 'p' , 'article' , 'blockquote' , 'section' , 'details' , 'summary' , 'pre' , 'code' , 'ul' , 'ol' , 'li' , 'dd' , 'dl' , 'table' , 'th' , 'tr' , 'td' , 'thead' , 'tbody' , 'tfood' , 'fieldset' , 'legend' , 'figure' , 'figcaption' , 'img' , 'picture' , 'video' , 'audio' , 'source' , 'iframe' , 'progress' , 'br' , 'p' , 'hr' , 'label' , 'abbr' , 'a' , 'svg' ], allowedAttributes : { a : [ 'href' , 'target' , 'title' ], abbr : [ 'title' ], progress : [ 'value' , 'max' ], img : [ 'src' , 'srcset' , 'alt' , 'width' , 'height' , 'style' , 'title' ], picture : [ 'media' , 'srcset' ], video : [ 'controls' , 'width' , 'height' , 'autoplay' , 'muted' ], audio : [ 'controls' ], source : [ 'src' , 'srcset' , 'data-srcset' , 'type' , 'media' , 'sizes' ], iframe : [ 'src' , 'frameborder' , 'height' , 'width' , 'scrolling' ], svg : [ 'width' , 'height' ] }, allowedIframeDomains : [ 'youtube.com' , 'vimeo.com' ] }

Read sanitize-html docs for more info.

Test

git clone https://github.com/ndaidong/article-parser.git cd article-parser npm install npm test npm run eval {URL_TO_PARSE_ARTICLE}

