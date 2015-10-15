Javascript text tokenizer that is easy to use and compose.

Installation

$ npm install tokenize- text

Usage

var Tokenizer = require ( 'tokenize-text' ); var tokenize = new Tokenizer();

This is the main method of this module, all other methods are using it.

fn will be called with 4 arguments:

text : text value of the token ( text == currentToken.value )

: text value of the token ( ) currentToken : current token object

: current token object prevToken : precedent token (or null)

: precedent token (or null) nextToken : next token (or null)

fn should return a string, an array of string, a token or an array of tokens.

tokenize.split(fn) returns a tokenizer function that accept a list of tokens or a string argument (it will be convert as one token).

The tokenizer function returns an array of tokens with the following properties:

value : text content of the token

: text content of the token index : absolute position in the original text

: absolute position in the original text offset : length of the token (equivalent to value.length )

var splitIn2 = tokenize.split( function ( text, currentToken, prevToken, nextToken ) { return [ text.slice( 0 , text.length / 2 ), text.slice(text.length / 2 ) ] }); var tokens = splitIn2( 'hello' );

Tokenize using a regular expression:

var extractUppercase = tokenize.re( /[A-Z]/ ); var tokens = extractUppercase( 'aBcD' );

Tokenize and split as characters, tokenize.characters() is equivalent to tokenize.re(/[^\s]/) .

var tokens = tokenize.characters()( 'abc' );

Split in sections, sections are split by

. , ; ! ? .

var tokens = tokenize.sections()( 'this is sentence 1. this is sentence 2' );

Split in words:

var tokens = tokenize.words()( 'hello, how are you?' );

Filter the list of tokens by calling fn(token) :

var extractNames = tokenize.filter( function ( word, current, prev ) { return (prev && /[A-Z]/ .test(word[ 0 ])); }); var words = tokenize.words()( 'My name is Samy.' ); var tokens = extractNames(words);

Creates a tokenizer that returns the result of invoking the provided tokenizers for each input token.

var extractNames = tokenize.flow( tokenize.words(), tokenize.filter( function ( word, current, prev ) { return (prev && /[A-Z]/ .test(word[ 0 ])); }) ); var tokens = extractNames( 'My name is Samy.' );

To execute all tokenizer in series, you can use tokenize.serie(fn1, fn2, [...]) instead.

Examples

Extract repeated words in sentences

Example to extract all repeated words in sentences: