pre

prenlp

Preprocessing Library for Natural Language Processing

Showing:

Popularity

Downloads/wk

0

GitHub Stars

139

Maintenance

Last Commit

2yrs ago

Contributors

1

Package

Dependencies

5

License

Categories

Readme

PreNLP

PyPI License GitHub stars GitHub forks

Preprocessing Library for Natural Language Processing

Installation

Requirements

  • Python >= 3.6
  • Mecab morphological analyzer for Korean
    sh scripts/install_mecab.sh
    # Only for Mac OS users, run the code below before run install_mecab.sh script.
    # export MACOSX_DEPLOYMENT_TARGET=10.10
    # CFLAGS='-stdlib=libc++' pip install konlpy
    
  • C++ Build tools for fastText

With pip

prenlp can be installed using pip as follows:

pip install prenlp

Usage

Data

Dataset Loading

Popular datasets for NLP tasks are provided in prenlp. All datasets is stored in /.data directory.

  • Sentiment Analysis: IMDb, NSMC
  • Language Modeling: WikiText-2, WikiText-103, WikiText-ko, NamuWiki-ko
DatasetLanguageArticlesSentencesTokensVocabSize
WikiText-2English720-2,551,84333,27813.3MB
WikiText-103English28,595-103,690,236267,735517.4MB
WikiText-koKorean477,9462,333,930131,184,780662,949667MB
NamuWiki-koKorean661,03216,288,639715,535,7781,130,0083.3GB
WikiText-ko+NamuWiki-koKorean1,138,97818,622,569846,720,5581,360,5383.95GB

General use cases are as follows:

WikiText-2 / WikiText-103
>>> wikitext2 = prenlp.data.WikiText2()
>>> len(wikitext2)
3
>>> train, valid, test = prenlp.data.WikiText2()
>>> train[0]
'= Valkyria Chronicles III ='
IMDB
>>> imdb_train, imdb_test = prenlp.data.IMDB()
>>> imdb_train[0]
["Minor Spoilers<br /><br />Alison Parker (Cristina Raines) is a successful top model, living with the lawyer Michael Lerman (Chris Sarandon) in his apartment. She tried to commit ...", 'pos']

Normalization

Frequently used normalization functions for text pre-processing are provided in prenlp.

url, HTML tag, emoticon, email, phone number, etc.

General use cases are as follows:

>>> from prenlp.data import Normalizer
>>> normalizer = Normalizer(url_repl='[URL]', tag_repl='[TAG]', emoji_repl='[EMOJI]', email_repl='[EMAIL]', tel_repl='[TEL]', image_repl='[IMG]')

>>> normalizer.normalize('Visit this link for more details: https://github.com/')
'Visit this link for more details: [URL]'

>>> normalizer.normalize('Use HTML with the desired attributes: <img src="cat.jpg" height="100" />')
'Use HTML with the desired attributes: [TAG]'

>>> normalizer.normalize('Hello 🤩, I love you 💓 !')
'Hello [EMOJI], I love you [EMOJI] !'

>>> normalizer.normalize('Contact me at lyeoni.g@gmail.com')
'Contact me at [EMAIL]'

>>> normalizer.normalize('Call +82 10-1234-5678')
'Call [TEL]'

>>> normalizer.normalize('Download our logo image, logo123.png, with transparent background.')
'Download our logo image, [IMG], with transparent background.'

Tokenizer

Frequently used (subword) tokenizers for text pre-processing are provided in prenlp.

SentencePiece, NLTKMosesTokenizer, Mecab

SentencePiece

>>> from prenlp.tokenizer import SentencePiece
>>> SentencePiece.train(input='corpus.txt', model_prefix='sentencepiece', vocab_size=10000)
>>> tokenizer = SentencePiece.load('sentencepiece.model')
>>> tokenizer('Time is the most valuable thing a man can spend.')
['▁Time', '▁is', '▁the', '▁most', '▁valuable', '▁thing', '▁a', '▁man', '▁can', '▁spend', '.']
>>> tokenizer.tokenize('Time is the most valuable thing a man can spend.')
['▁Time', '▁is', '▁the', '▁most', '▁valuable', '▁thing', '▁a', '▁man', '▁can', '▁spend', '.']
>>> tokenizer.detokenize(['▁Time', '▁is', '▁the', '▁most', '▁valuable', '▁thing', '▁a', '▁man', '▁can', '▁spend', '.'])
Time is the most valuable thing a man can spend.

Moses tokenizer

>>> from prenlp.tokenizer import NLTKMosesTokenizer
>>> tokenizer = NLTKMosesTokenizer()
>>> tokenizer('Time is the most valuable thing a man can spend.')
['Time', 'is', 'the', 'most', 'valuable', 'thing', 'a', 'man', 'can', 'spend', '.']

Comparisons with tokenizers on IMDb

Below figure shows the classification accuracy from various tokenizer.

Comparisons with tokenizers on NSMC (Korean IMDb)

Below figure shows the classification accuracy from various tokenizer.

Author

Rate & Review

Great Documentation0
Easy to Use0
Performant0
Highly Customizable0
Bleeding Edge0
Responsive Maintainers0
Poor Documentation0
Hard to Use0
Slow0
Buggy0
Abandoned0
Unwelcoming Community0
100