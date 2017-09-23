LDA

Latent Dirichlet allocation (LDA) topic modeling in javascript for node.js. LDA is a machine learning algorithm that extracts topics and their related keywords from a collection of documents.

In LDA, a document may contain several different topics, each with their own related terms. The algorithm uses a probabilistic model for detecting the number of topics specified and extracting their related keywords. For example, a document may contain topics that could be classified as beach-related and weather-related. The beach topic may contain related words, such as sand, ocean, and water. Similarly, the weather topic may contain related words, such as sun, temperature, and clouds.

See http://en.wikipedia.org/wiki/Latent_Dirichlet_allocation

$ npm install lda

Usage

var lda = require ( 'lda' ); var text = 'Cats are small. Dogs are big. Cats like to chase mice. Dogs like to eat bones.' ; var documents = text.match( /[^\.!\?]+[\.!\?]+/g ); var result = lda(documents, 2 , 5 );

The above example produces the following result with two topics (topic 1 is "cat-related", topic 2 is "dog-related"):

Topic 1 cats (0 .21 %) dogs (0 .19 %) small (0 .1 %) mice (0 .1 %) chase (0 .1 %) Topic 2 dogs (0 .21 %) cats (0 .19 %) big (0 .11 %) eat (0 .1 %) bones (0 .1 %)

Output

LDA returns an array of topics, each containing an array of terms. The result contains the following format:

[ [ { term: 'dogs' , probability: 0.2 }, { term: 'cats' , probability: 0.2 }, { term: 'small' , probability: 0.1 }, { term: 'mice' , probability: 0.1 }, { term: 'chase' , probability: 0.1 } ], [ { term: 'dogs' , probability: 0.2 }, { term: 'cats' , probability: 0.2 }, { term: 'bones' , probability: 0.11 }, { term: 'eat' , probability: 0.1 }, { term: 'big' , probability: 0.099 } ] ]

The result can be traversed as follows:

var result = lda(documents, 2 , 5 ); for ( var i in result) { var row = result[i]; console .log( 'Topic ' + ( parseInt (i) + 1 )); for ( var j in row) { var term = row[j]; console .log(term.term + ' (' + term.probability + '%)' ); } console .log( '' ); }

Additional Languages

LDA uses stop-words to ignore common terms in the text (for example: this, that, it, we). By default, the stop-words list uses English. To use additional languages, you can specify an array of language ids, as follows:

result = lda(documents, 2 , 5 , [ 'en' ]); result = lda(documents, 2 , 5 , [ 'de' ]); result = lda(documents, 2 , 5 , [ 'en' , 'de' ]);

To add a new language-specific stop-words list, create a file /lda/lib/stopwords_XX.js where XX is the id for the language. For example, a French stop-words list could be named "stopwords_fr.js". The contents of the file should follow the format of an existing stop-words list. The format is, as follows:

exports.stop_words = [ 'cette' , 'que' , 'une' , 'il' ];

Setting a Random Seed

A specific random seed can be used to compute the same terms and probabilities during subsequent runs. You can specify the random seed, as follows:

result = lda(documents, 2 , 5 , null , null , null , 123 );

Author

Kory Becker http://www.primaryobjects.com

Based on original javascript implementation https://github.com/awaisathar/lda.js