Multilingual tokenizer that automatically tags each token with its type

Tokenize sentences in Latin and Devanagari scripts using wink-tokenizer . Some of it's top feature are outlined below:

Support for English, French, German, Hindi, Sanskrit, Marathi and many more. Intelligent tokenization of sentence containing words in more than one language. Automatic detection & tagging of different types of tokens based on their features: These include word, punctuation, email, mention, hashtag, emoticon, and emoji etc.

User definable token types. High performance – tokenizes a typical english sentence at speed of over 2.4 million tokens/second and a complex tweet containing hashtags, emoticons, emojis, mentions, e-mail at a speed of over 1.5 million tokens/second (benchmarked on 2.2 GHz Intel Core i7 machine with 16GB RAM).

Installation

Use npm to install:

npm install wink-tokenizer --save

Getting Started

var tokenizer = require ( 'wink-tokenizer' ); var myTokenizer = tokenizer(); var s = '@superman: hit me up on my email r2d2@gmail.com, 2 of us plan party🎉 tom at 3pm:) #fun' ; myTokenizer.tokenize( s ); s = 'Mieux vaut prévenir que guérir:-)' ; myTokenizer.tokenize( s ); s = 'द्रविड़ ने टेस्ट में ३६ शतक जमाए, उनमें 21 विदेशी playground पर हैं।' ; myTokenizer.tokenize( s );

Documentation

Check out the tokenizer API documentation to learn more.

Need Help?

If you spot a bug and the same has not yet been reported, raise a new issue or consider fixing it and sending a pull request.

About wink

Wink is a family of open source packages for Statistical Analysis, Natural Language Processing and Machine Learning in NodeJS. The code is thoroughly documented for easy human comprehension and has a test coverage of ~100% for reliability to build production grade solutions.

Copyright & License

wink-tokenizer is copyright 2017-21 GRAYPE Systems Private Limited.

It is licensed under the terms of the MIT License.