top

topick

One trick pony NLP library for extracting keywords from HTML documents

Showing:

Popularity

Downloads/wk

2

GitHub Stars

18

Maintenance

Last Commit

6yrs ago

Contributors

0

Package

Dependencies

5

Size (min+gzip)

99.2KB

License

MIT

Type Definitions

Tree-Shakeable

No?

Categories

Readme

Topick

One trick pony NLP library for extracting keywords from HTML documents. It uses htmlparser2 for HTML parsing, nlp_compromise for NLP and text-miner for text cleaning and removing stop words.

Here is Topick in action, being used in a Telegram bot to autosuggest tags for links mentioned in a conversation:

Installation

npm install topick

Topick is intended primarily for server-side use because of cross-domain issues, although I'm working on making the codebase isomorphic so that browser use is possible as well (with an appropriate module loader such as webpack).

Usage

The simplest way to use Topick:

import Topick from 'topick'

Topick.getKeywords('http://example.com/').then((keywords) => {
  console.log(keywords); // ['most relevant keyword', 'very relevant keyword', 'somewhat relevant keyword']
  // do something with your keywords
})

The keywords are arranged in order of decreasing relevance.

getKeywords(uri[,opts,cb])

Options

getKeywords takes either a valid HTTP URI, or a HTML string, and returns a promise that can be resolved appropriately. getKeywords also accepts an optional options object:

Topick.getKeywords('http://example.com/', {
  htmlTags: ['p'],
  ngram: {
    min_count: 4,
    max_size: 2
  }
}).then((keywords) => {
  console.log(keywords);
})

Currently available options are:

htmlTags

Default: ['p', 'b', 'em', 'title']

An array of HTML tags that should be parsed.

method

Default: combined

Topick includes three methods for generating keywords.

ngram

Generates n-grams from the content string and ranks them in terms of frequency.

namedentities

Uses nlp_compromise's spot method to identify named entities before generating n-grams based on these named entities.

combined

Runs both ngram and namedentities methods, then combines their ranking.

useDefaultStopWords

Default: true

If true, uses Topick's internal stop words dictionary to remove stop words. If false, no stop word removal will be performed unless you supply your own stop word array (see customStopWords).

Topick's dictionary is a set union of all six English collections found here.

customStopWords

Default: []

An array of strings that should be used as stop words. This has no bearing on useDefaultStopWords, although it should be populated with your own stop word array if useDefaultStopWords is set to false, else Topick will generate a lot of irrelevant keywords.

maxNumberOfKeywords

Default: 10

Maximum number of keywords to generate.

minKeywordLength

Default: 3

Minimum length of generated keywords.

ngram

Default:

{ min_count: 3, max_size: 1 }

Defines options for n-gram generation.

min_count is the minimum number of times a particular n-gram should appear in the document before being considered. There should be no need to change this number.

max_size is the maximum size of n-grams that should be generated (defaults to generating unigrams).

progressiveGeneration

Default: true

If set to true, progressiveGeneration will progressively generate n-grams with weaker settings until the specified number of keywords set in maxNumberOfKeywords is hit.

For example: for a min_count of 3 and maxNumberOfKeywords of 10, Topick only generates 5 keywords, then progressiveGeneration will decrease the min_count to 2, and then to 1, until 10 keywords can be generated.

progressiveGeneration does not guarantee that maxNumberOfKeywords keywords will be generated (like if even at min_count of 1, your specified maxNumberOfKeywords still cannot be reached).

Callback

In case you're not familar with promises or are unable to use them, getKeywords also accepts a callback function as its last argument:

topick.getKeywords("http://example.com", {
  customStopWords: []
}, (keywords) => {
  console.log("This is the callback function");
  console.log(keywords);
})
.then((keywords) => {
  console.log("This is the promise");
  console.log(keywords)
})

// "This is the callback function"
// ["cool keyword", "another cool keyword"]
// "This is the promise"
// ["cool keyword", "another cool keyword"]

Notice that regardless of whether a callback function is specified, getKeywords continues to return a Promise.

getKeywordsSync(uri[, opts])

There are no plans to support a synchronous version of getKeywords.

getDomain(uri)

Example:

Topick.getDomain('http://example.com')

Given http://example.com, returns example. Removes URI scheme, port number, and TLD.

Contributing

Contributions are welcome!

Topick is written in ES6 wherever possible. The development workflow is centered primarily around webpack, so be sure to check out webpack.config.js.

Rate & Review

Great Documentation0
Easy to Use0
Performant0
Highly Customizable0
Bleeding Edge0
Responsive Maintainers0
Poor Documentation0
Hard to Use0
Slow0
Buggy0
Abandoned0
Unwelcoming Community0
100