pre
prenlp
pypi i prenlp
pre

prenlp

Preprocessing Library for Natural Language Processing

by Allen Lee

0.0.13 (see all)
pypi i prenlp
Readme

PreNLP

PyPI License GitHub stars GitHub forks

Preprocessing Library for Natural Language Processing

Installation

Requirements

  • Python >= 3.6
  • Mecab morphological analyzer for Korean
    sh scripts/install_mecab.sh
    # Only for Mac OS users, run the code below before run install_mecab.sh script.
    # export MACOSX_DEPLOYMENT_TARGET=10.10
    # CFLAGS='-stdlib=libc++' pip install konlpy
    
  • C++ Build tools for fastText

With pip

prenlp can be installed using pip as follows:

pip install prenlp

Usage

Data

Dataset Loading

Popular datasets for NLP tasks are provided in prenlp. All datasets is stored in /.data directory.

  • Sentiment Analysis: IMDb, NSMC
  • Language Modeling: WikiText-2, WikiText-103, WikiText-ko, NamuWiki-ko
DatasetLanguageArticlesSentencesTokensVocabSize
WikiText-2English720-2,551,84333,27813.3MB
WikiText-103English28,595-103,690,236267,735517.4MB
WikiText-koKorean477,9462,333,930131,184,780662,949667MB
NamuWiki-koKorean661,03216,288,639715,535,7781,130,0083.3GB
WikiText-ko+NamuWiki-koKorean1,138,97818,622,569846,720,5581,360,5383.95GB

General use cases are as follows:

WikiText-2 / WikiText-103
>>> wikitext2 = prenlp.data.WikiText2()
>>> len(wikitext2)
3
>>> train, valid, test = prenlp.data.WikiText2()
>>> train[0]
'= Valkyria Chronicles III ='
IMDB
>>> imdb_train, imdb_test = prenlp.data.IMDB()
>>> imdb_train[0]
["Minor Spoilers<br /><br />Alison Parker (Cristina Raines) is a successful top model, living with the lawyer Michael Lerman (Chris Sarandon) in his apartment. She tried to commit ...", 'pos']

Normalization

Frequently used normalization functions for text pre-processing are provided in prenlp.

url, HTML tag, emoticon, email, phone number, etc.

General use cases are as follows:

>>> from prenlp.data import Normalizer
>>> normalizer = Normalizer(url_repl='[URL]', tag_repl='[TAG]', emoji_repl='[EMOJI]', email_repl='[EMAIL]', tel_repl='[TEL]', image_repl='[IMG]')

>>> normalizer.normalize('Visit this link for more details: https://github.com/')
'Visit this link for more details: [URL]'

>>> normalizer.normalize('Use HTML with the desired attributes: <img src="cat.jpg" height="100" />')
'Use HTML with the desired attributes: [TAG]'

>>> normalizer.normalize('Hello 🤩, I love you 💓 !')
'Hello [EMOJI], I love you [EMOJI] !'

>>> normalizer.normalize('Contact me at lyeoni.g@gmail.com')
'Contact me at [EMAIL]'

>>> normalizer.normalize('Call +82 10-1234-5678')
'Call [TEL]'

>>> normalizer.normalize('Download our logo image, logo123.png, with transparent background.')
'Download our logo image, [IMG], with transparent background.'

Tokenizer

Frequently used (subword) tokenizers for text pre-processing are provided in prenlp.

SentencePiece, NLTKMosesTokenizer, Mecab

SentencePiece

>>> from prenlp.tokenizer import SentencePiece
>>> SentencePiece.train(input='corpus.txt', model_prefix='sentencepiece', vocab_size=10000)
>>> tokenizer = SentencePiece.load('sentencepiece.model')
>>> tokenizer('Time is the most valuable thing a man can spend.')
['▁Time', '▁is', '▁the', '▁most', '▁valuable', '▁thing', '▁a', '▁man', '▁can', '▁spend', '.']
>>> tokenizer.tokenize('Time is the most valuable thing a man can spend.')
['▁Time', '▁is', '▁the', '▁most', '▁valuable', '▁thing', '▁a', '▁man', '▁can', '▁spend', '.']
>>> tokenizer.detokenize(['▁Time', '▁is', '▁the', '▁most', '▁valuable', '▁thing', '▁a', '▁man', '▁can', '▁spend', '.'])
Time is the most valuable thing a man can spend.

Moses tokenizer

>>> from prenlp.tokenizer import NLTKMosesTokenizer
>>> tokenizer = NLTKMosesTokenizer()
>>> tokenizer('Time is the most valuable thing a man can spend.')
['Time', 'is', 'the', 'most', 'valuable', 'thing', 'a', 'man', 'can', 'spend', '.']

Comparisons with tokenizers on IMDb

Below figure shows the classification accuracy from various tokenizer.

Comparisons with tokenizers on NSMC (Korean IMDb)

Below figure shows the classification accuracy from various tokenizer.

Author

VersionTagPublished
0.0.13
3yrs ago
0.0.12
3yrs ago
0.0.11
3yrs ago
0.0.10
3yrs ago
No alternatives found
No tutorials found
Add a tutorial
No dependencies found

Rate & Review

100
No reviews found
Be the first to rate