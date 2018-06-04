Compendium

English NLP for Node.js and the browser. Try it in your browser!

35k gzipped, Part-of-Speech tagging (92% on Penn treebank), entity recognition, sentiment analysis and more, MIT licensed.

Summary

Client-side install

Step 1: get the lib

Install it with bower:

bower install --save compendium

Or clone this repo and copy the dist/compendium.minimal.js file into your project.

Step 2: include the lib in your HTML page

< script type = "text/javascript" src = "path/to/compendium/dist/compendium.minimal.js" > </ script >

In order to ensure that Compendium will work as intended, you must specify the encoding of the HTML page as UTF-8.

Step 3: enjoy

Call the compendium.analyse function with a string as parameter, and get a complete analysis of the text.

console.log( compendium. analyse ( 'Hello world :)' ) );

Node.js install

Step 1: get the lib

npm install --save compendium-js

Step 2: enjoy

var compendium = require ( 'compendium-js' ); console .log(compendium.analyse( 'Hello world :)' ));

API

The main function to call is analyse .

It takes a string as unique argument, and returns an array containing an analysis for each sentence. For example, calling:

compendium .analyse ( 'My name is Dr. Jekyll.' );

will return an array like this one:

[ { time: 9 , // Time of processing, in ms length: 6 , // Count of tokens raw: 'My name is Dr. Jekyll .' , // Raw string stats: { confidence: 0.4583 , // PoS tagging confidence p_foreign: 0 , // Percentage of foreign PoS tags, e.g. `FW` p_upper: 0 , // Percentage of uppercased tokens, e.g. `HELLO` p_cap: 50 , // Percentage of capitalized tokens, e.g. `Hello` avg_length: 3 }, // Average token length profile: { label: 'neutral' , // Sentiment: `negative`, `neutral`, `positive`, `mixed` sentiment: 0 , // Sentiment score amplitude: 0 , // Sentiment amplitude types: [], // Types ('tags') of sentence politeness: 0 , // Politeness score dirtiness: 0 , // Dirtiness score negated: false }, // Is sentence mainly negated entities: [ { // List of entities raw: 'Dr. Jekyll' , // Raw reconstructed entity norm: 'doctor jekyll' , // Normalized entity fromIndex: 3 , // Start token index toIndex: 4 , // End token index type: null } ], // Type of entity: null for unknown, `ip`, `email`... tags: // Array of PoS tags [ 'PRP$' , 'NN' , 'VBZ' , 'NNP' , 'NNP' , '.' ], tokens: // Tokens details [ { raw: 'My' , // Raw token norm: 'my' , // Normalized pos: 'PRP$' , // PoS tag profile: { sentiment: 0 , // Sentiment score emphasis: 1 , // Emphasis multiplier negated: false , // Is negated breakpoint: false }, // Is breakpoint attr: { acronym: false , // Is acronym plural: false , // Is plural abbr: false , // Is an abbreviation verb: false , // Is a verb entity: -1 } }, // Entity index, `-1` if no entity // // ... Other tokens // ] } ]

Skipping detectors

From version 0.0.26, in order to speed up the analyse, one can use the skipDetectors argument of the analyse function to skip some specific detectors.

Skippable detectors are the following:

sentiment : Sentiment analysis

: Sentiment analysis entities : Entity extraction

: Entity extraction negation : Negation detection

: Negation detection type : Type detection (declarative, interrogative...)

: Type detection (declarative, interrogative...) numeric : Numeric values extraction

For example, the following call to analyse won't run the entity extraction detector, meaning that Dr. Jekyll won't appear in the entities section of the analysis result:

compendium. analyse ( 'My name is Dr. Jekyll.' , null , [ 'entities' ]);

Processing overview

Decoding

Handles decoding of HTML entities (e.g. & to & ), and normalization of some abbreviations that involve breakpoints chars (e.g. w/ to with ).

Lexer

No good part-of-speech tagging is possible without a good lexer. A lot of efforts has been put into the Compendium's lexer, so it provides the right tokens to be processed. Currently the lexer is a combination of four passes:

A first pass splits the text into sentences

A second one applies some regular expressions to extract specific parts of the sentences (URLs, emails, emoticons...)

The third pass is a char by char parser that splits tokens in a sentence, relying on Punycode.js to properly handle emojis

The final pass consolidates tokens such as acronyms, abbreviations, contractions..., and handles a few exceptions

Cleaner

This very little piece runs after the lexer, and is in charge to normalize a few other slangs (e.g. gr8 to great ).

Part-of-speech tagging

Tagging is performed using a Brill tagger (i.e. a base lexicon and a set of rules), with the addition of some inflection-based rules.

It's been inspired by the following projects that are worth being checked out:

Eric Brill tagger: latest implementation published under MIT license is available for download on the Plymouth University website at this link (direct download).

Mark Watson's FastTag Java library, a very simple implementation of the Brill's tagger.

NLP Compromise, another great JS NLP toolkit, with an interesting inflection-based approach

PoS tagging is tested a set of unit tests generated with the Stanford PoS tagger, double checked with common sense and another machine-learning oriented tagger, and is then evaluated using the Penn Treebank dataset.

In September 2015, Compendium PoS tagging score on Penn Treebank was 92.76% tags recognized for the browser version, and 94.31% for the Node.js version.

Dependency parsing

Warning: the following process has been proved hardly extensible, and isn't powerful enough given the amount of code already. It's being replaced in v1.0 by another one currently in development [September 5th, 2015].

Dependency parsing module. Still experimental, and requires a lot of additional rules, but promising.

Inspired in some extent by Syntex from Didier Bourigault ref. (fr).

Constraint based. Constraints are:

The governor is the head of the sentence (it doesnt have a master)

When possible, the governor is the first conjugated verb of the sentence

All other tokens must have a master

A token can have one and only one master

A master can have one or many dependencies

If no master is found for a token, then its master is the governor

Parsing is done through several passes:

First pass define direct dependencies from left to right Second pass define direct dependencies from right to left Third pass consolidate linked indirect dependencies using existing masters Final pass consolidate unlinked indirect dependencies

Detectors

Starting from here, some detectors handle further analysis of the text. They're in charge to add some metadata to the analysis, such as the sentiment score and label.

These detectors can work at three different levels:

the token level

the sentence level

the text (global) level

Token level detectors add attributes to each token (sentiment and emphasis scores, normalized token...).

Sentence level detectors work accross many tokens (negation detection, entity recognition, sentiment analysis...).

Global level detectors (there are none yet) are supposed to provide a global analysis of the whole text: topics, global sentiment labelling...

Lexicons

The full lexicon for Node.js is based on the lexicon from Mark Watson's FastTag (around 90 000 terms, itself being imported from the Penn Treebank).

The minimal lexicon for the browser contains only a few thousands terms extracted from the full lexicon, and filtered using:

the list of the 10000 most common English words, an extract from the Google's Trillion Word Corpus

the list of scored sentiments words

Compendium suffixes detector

Compendium embedded knowledge

Here is the list of Part-of-Speech tags used by Compendium. See at the bottom newly introduced tags.

, Comma , : Mid-sent punct. : ; . Sent-final punct . ! ? " quote " ( Left paren ( ) Right paren ) # Pound sign # CC Coord Conjuncn and ,but, or CD Cardinal number one,two, 1 , 2 DT Determiner the, some EX Existential there there FW Foreign Word mon dieu IN Preposition of , in , by JJ Adjective big JJR Adj., comparative bigger JJS Adj., superlative biggest LS List item marker 1 ,One MD Modal can,should NN Noun, sing. or mass dog NNP Proper noun, sing. Edinburgh NNPS Proper noun, plural Smiths NNS Noun, plural dogs PDT Predeterminer all , both POS Possessive ending 's PP Personal pronoun I,you,she PRP$ Possessive pronoun my,one' s RB Adverb quickly, not RBR Adverb, comparative faster RBS Adverb, superlative fastest RP Particle up, off SYM Symbol +,%,& TO 'to' to UH Interjection oh, oops VB verb, base form eat VBD verb, past tense ate VBG verb, gerund eating VBN verb, past part eaten VBP Verb, present eat VBZ Verb, present eats WDT Wh-determiner which,that WP Wh pronoun who,what WP$ Possessive-Wh whose WRB Wh-adverb how, where

Compendium also includes the following new tag:

EM Emoticon :) :(

Development

Go to the wiki to get more details about the project.

License

The MIT License (MIT)

Copyright (c) 2015 Ulflander

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.