vte

vtext

Simple NLP in Rust with Python bindings

Showing:

Popularity

Downloads/wk

0

GitHub Stars

116

Maintenance

Last Commit

1yr ago

Contributors

5

Package

Dependencies

4

License

Apache License 2.0

Categories

Readme

vtext

Crates.io PyPI CircleCI Build Status

NLP in Rust with Python bindings

This package aims to provide a high performance toolkit for ingesting textual data for machine learning applications.

Features

  • Tokenization: Regexp tokenizer, Unicode segmentation + language specific rules
  • Stemming: Snowball (in Python 15-20x faster than NLTK)
  • Token counting: converting token counts to sparse matrices for use in machine learning libraries. Similar to CountVectorizer and HashingVectorizer in scikit-learn but will less broad functionality.
  • Levenshtein edit distance; Sørensen-Dice, Jaro, Jaro Winkler string similarities

Usage

Usage in Python

vtext requires Python 3.6+ and can be installed with,

pip install vtext

Below is a simple tokenization example,

>>> from vtext.tokenize import VTextTokenizer
>>> VTextTokenizer("en").tokenize("Flights can't depart after 2:00 pm.")
["Flights", "ca", "n't", "depart" "after", "2:00", "pm", "."]

For more details see the project documentation: vtext.io/doc/latest/index.html

Usage in Rust

Add the following to Cargo.toml,

[dependencies]
vtext = "0.2.0"

For more details see rust documentation: docs.rs/vtext

Benchmarks

Tokenization

Following benchmarks illustrate the tokenization accuracy (F1 score) on UD treebanks ,

langdatasetregexpspacy 2.1vtext
enEWT0.8120.9720.966
enGUM0.8810.9890.996
deGSD0.8960.9440.964
frSequoia0.8440.9680.971

and the English tokenization speed,

regexpspacy 2.1vtext
Speed (10⁶ tokens/s)3.10.142.1

Text vectorization

Below are benchmarks for converting textual data to a sparse document-term matrix using the 20 newsgroups dataset, run on Intel(R) Xeon(R) CPU E3-1270 v6 @ 3.80GHz,

Speed (MB/s)scikit-learn 0.20.1vtext (n_jobs=1)vtext (n_jobs=4)
CountVectorizer.fit14104225
CountVectorizer.transform1482303
CountVectorizer.fit_transform1470NA
HashingVectorizer.transform1989309

Note however that these two estimators in vtext currently support only a fraction of scikit-learn's functionality. See benchmarks/README.md for more details.

License

vtext is released under the Apache License, Version 2.0.

Rate & Review

Great Documentation0
Easy to Use0
Performant0
Highly Customizable0
Bleeding Edge0
Responsive Maintainers0
Poor Documentation0
Hard to Use0
Slow0
Buggy0
Abandoned0
Unwelcoming Community0
100
No reviews found
Be the first to rate

Alternatives

No alternatives found

Tutorials

No tutorials found
Add a tutorial