textvec

Text vectorization tool to outperform TFIDF for classification tasks

Showing:

Popularity

Downloads/wk

0

GitHub Stars

178

Maintenance

Last Commit

10mos ago

Contributors

6

Package

Dependencies

0

License

MIT

Categories

Readme

textvec logo

WHAT: Supervised text vectorization tool

Textvec is a text vectorization tool, with the aim to implement all the "classic" text vectorization NLP methods in Python. The main idea of this project is to show alternatives for an excellent TFIDF method which is highly overused for supervised tasks. All interfaces are similar to scikit-learn so you should be able to test the performance of this supervised methods just with a few changes.

Textvec is compatible with: Python 2.7-3.7.


WHY: Comparison with TFIDF

As you can read in the different articles1,2 almost on every dataset supervised methods outperform unsupervised. But most text classification examples on the internet ignores that fact.

IMDB_binRT_binAirlines Sentiment_binAirlines Sentiment_multiclass20news_multiclass
TF0.89840.75710.91940.80840.8206
TFIDF0.90520.77170.92590.81180.8575
TFPF0.88130.74030.9212NANA
TFRF0.87970.74120.9194NANA
TFICF0.89840.76420.91990.81250.8292
TFBINICF0.89840.75710.9194NANA
TFCHI20.88980.73980.9108NANA
TFGR0.88500.70650.8956NANA
TFRRF0.88790.75060.9194NANA
TFOR0.90920.78060.9207NANA

Here is a comparison for binary classification on imdb sentiment data set. Labels sorted by accuracy score and the heatmap shows the correlation between different approaches. As you can see some methods are good for to ensemble models or perform features selection.

Binary comparison

For more dataset benchmarks (rotten tomatoes, airline sentiment) see Binary classification quality comparison


Install:

Usage:

pip install textvec

Source code:

git clone https://github.com/textvec/textvec
cd textvec
pip install .

HOW: Examples

The usage is similar to scikit-learn:

from sklearn.feature_extraction.text import CountVectorizer
from textvec.vectorizers import TfBinIcfVectorizer

cvec = CountVectorizer().fit(train_data.text)

tficf_vec = TfBinIcfVectorizer(sublinear_tf=True)
tficf_vec.fit(cvec.transform(text), y)

For more detailed examples see Basic example and other notebooks in Examples

Currently implemented methods:

  • TfIcfVectorizer
  • TforVectorizer
  • TfgrVectorizer
  • TfigVectorizer
  • Tfchi2Vectorizer
  • TfrfVectorizer
  • TfrrfVectorizer
  • TfBinIcfVectorizer
  • TfpfVectorizer
  • SifVectorizer
  • TfbnsVectorizer

Most of the vectorization techniques you can find in articles1,2,3. If you see any method with wrong name or reference please commit!


TODO

  • Docs

REFERENCE

Rate & Review

Great Documentation0
Easy to Use0
Performant0
Highly Customizable0
Bleeding Edge0
Responsive Maintainers0
Poor Documentation0
Hard to Use0
Slow0
Buggy0
Abandoned0
Unwelcoming Community0
100