tt

tiny-tokenizer

pypi i tiny-tokenizer

19 Versions

3.4.0

2 years ago

3.3.0

2 years ago

3.2.0

2 years ago

3.1.0

3 years ago
  • Use poetry for development #53
  • Support Janome, which is pure-python morphological analyzer #57

3.0.2

3 years ago
  • Support system dictionary in MeCab #42
  • Support custom model in KyTea #49

3.0.1

3 years ago
  • #41 Add whitespace tokenizer
In [1]: from tiny_tokenizer import WordTokenizer
In [2]: tk = WordTokenizer("whitespace")
In [3]: tk.tokenize("わたし は 猫")
Out[3]: [私, は, 猫]

3.0.0

3 years ago
  • #29, #37 Support detailed Token attributes (thanks @chie8842 @jun-harashima @ysak-y)
  • #35 Add extras_require (thanks @chie8842)
  • #39 Support Python3.5

2.1.0

3 years ago
from tiny_tokenizer import SentenceTokenizer
from tiny_tokenizer import WordTokenizer


if __name__ == "__main__":
    sentence_tokenizer = SentenceTokenizer()
    tokenizer = WordTokenizer(tokenizer="Sudachi", mode="A")
    #                                              ^^^^^^^^
    #                                 You can choose splitting mode.
    #
    #      (https://github.com/WorksApplications/SudachiPy#as-a-python-package)
    #

    sentence = "我輩は猫である."
    print("input: ", sentence)

    result = tokenizer.tokenize(sentence)
    print(result)

2.0.0

3 years ago

This release breaks compatibility.

Introduce Token class.


1.3.1

3 years ago

We can install tiny_tokenizer without word tokenizers.


1.3.0

3 years ago

1.2.1

3 years ago

1.2.0

3 years ago

Support character/sub-word tokenization.


1.1.0

4 years ago
  • Add Dockerfile
  • Add docstring
  • Update the example

1.0.4

4 years ago

1.0.3

4 years ago

1.0.2

4 years ago

1.0.1

4 years ago

1.0

4 years ago

19 Versions

TagPublished
3.4.02yrs ago
3.3.02yrs ago
3.2.02yrs ago
3.1.03yrs ago
3.0.23yrs ago
3.0.13yrs ago
3.0.03yrs ago
2.1.03yrs ago
2.0.03yrs ago
1.3.13yrs ago
1.3.03yrs ago
1.2.13yrs ago
1.2.03yrs ago
1.1.04yrs ago
1.0.44yrs ago
1.0.34yrs ago
1.0.24yrs ago
1.0.14yrs ago
1.04yrs ago