Natural Language Toolkit for bahasa Malaysia,





GitHub Stars



Last Commit

3d ago










.. raw:: html

<p align="center">
    <a href="#readme">
        <img alt="logo" width="40%" src="">
<p align="center">
    <a href=""><img alt="Pypi version" src=""></a>
    <a href=""><img alt="Python3 version" src=""></a>
    <a href=""><img alt="MIT License" src=""></a>
    <a href=""><img alt="Documentation" src=""></a>
    <a href=""><img alt="total stats" src=""></a>
    <a href=""><img alt="download stats / month" src=""></a>
    <a href=""><img alt="discord" src=",138,212).svg"></a>


Malaya is a Natural-Language-Toolkit library for bahasa Malaysia, powered by Deep Learning Tensorflow.


Proper documentation is available at

Installing from the PyPI

CPU version ::

$ pip install malaya

GPU version ::

$ pip install malaya[gpu]

Only Python 3.6.0 and above and Tensorflow 1.15.0 and above are supported.

We recommend to use virtualenv for development. All examples tested on Tensorflow version 1.15.4, 2.4.1 and 2.5.


  • Augmentation, augment any text using dictionary of synonym, Wordvector or Transformer-Bahasa.
  • Constituency Parsing, breaking a text into sub-phrases using finetuned Transformer-Bahasa.
  • Coreference Resolution, finding all expressions that refer to the same entity in a text using Dependency Parsing models.
  • Dependency Parsing, extracting a dependency parse of a sentence using finetuned Transformer-Bahasa.
  • Emotion Analysis, detect and recognize 6 different emotions of texts using finetuned Transformer-Bahasa.
  • Entities Recognition, seeks to locate and classify named entities mentioned in text using finetuned Transformer-Bahasa.
  • Generator, generate any texts given a context using T5-Bahasa, GPT2-Bahasa or Transformer-Bahasa.
  • Keyword Extraction, provide RAKE, TextRank and Attention Mechanism hybrid with Transformer-Bahasa.
  • Knowledge Graph, generate Knowledge Graph using T5-Bahasa or parse from Dependency Parsing models.
  • Language Detection, using Fast-text and Sparse Deep learning Model to classify Malay (formal and social media), Indonesia (formal and social media), Rojak language and Manglish.
  • Normalizer, using local Malaysia NLP researches hybrid with Transformer-Bahasa to normalize any bahasa texts.
  • Num2Word, convert from numbers to cardinal or ordinal representation.
  • Paraphrase, provide Abstractive Paraphrase using T5-Bahasa and Transformer-Bahasa.
  • Part-of-Speech Recognition, grammatical tagging is the process of marking up a word in a text using finetuned Transformer-Bahasa.
  • Question Answer, reading comprehension using finetuned Transformer-Bahasa.
  • Relevancy Analysis, detect and recognize relevancy of texts using finetuned Transformer-Bahasa.
  • Sentiment Analysis, detect and recognize polarity of texts using finetuned Transformer-Bahasa.
  • Text Similarity, provide interface for lexical similarity deep semantic similarity using finetuned Transformer-Bahasa.
  • Spell Correction, using local Malaysia NLP researches hybrid with Transformer-Bahasa to auto-correct any bahasa words and NeuSpell using T5-Bahasa.
  • Stemmer, using BPE LSTM Seq2Seq with attention state-of-art to do Bahasa stemming.
  • Subjectivity Analysis, detect and recognize self-opinion polarity of texts using finetuned Transformer-Bahasa.
  • Kesalahan Tatabahasa, Fix kesalahan tatabahasa using TransformerTag-Bahasa.
  • Summarization, provide Abstractive T5-Bahasa also Extractive interface using Transformer-Bahasa, skip-thought and Doc2Vec.
  • Topic Modelling, provide Transformer-Bahasa, LDA2Vec, LDA, NMF and LSA interface for easy topic modelling with topics visualization.
  • Toxicity Analysis, detect and recognize 27 different toxicity patterns of texts using finetuned Transformer-Bahasa.
  • Transformer, provide easy interface to load Pretrained Language models Malaya.
  • Translation, provide Neural Machine Translation using Transformer for EN to MS and MS to EN.
  • Word2Num, convert from cardinal or ordinal representation to numbers.
  • Word2Vec, provide pretrained bahasa wikipedia and bahasa news Word2Vec, with easy interface and visualization.
  • Zero-shot classification, provide Zero-shot classification interface using Transformer-Bahasa to recognize texts without any labeled training data.
  • Hybrid 8-bit Quantization, provide hybrid 8-bit quantization for all models to reduce inference time up to 2x and model size up to 4x.
  • Longer Sequences Transformer, provide BigBird + Pegasus for longer Abstractive Summarization, Neural Machine Translation and Relevancy Analysis sequences.

Pretrained Models

Malaya also released Bahasa pretrained models, simply check at Malaya/pretrained-model <>_


If you use our software for research, please cite:


@misc{Malaya, Natural-Language-Toolkit library for bahasa Malaysia, powered by Deep Learning Tensorflow, author = {Husein, Zolkepli}, title = {Malaya}, year = {2018}, publisher = {GitHub}, journal = {GitHub repository}, howpublished = {\url{}} }


Thanks to KeyReply <>_ for sponsoring private cloud to train Malaya models, without it, this library will collapse entirely.

.. raw:: html

<a href="#readme">
    <img alt="logo" width="20%" src="">

Also, thanks to Tensorflow Research Cloud <>_ for free TPUs access.

.. raw:: html

<a href="">
    <img alt="logo" width="20%" src="">


Thank you for contributing this library, really helps a lot. Feel free to contact me to suggest me anything or want to contribute other kind of forms, we accept everything, not just code!

.. raw:: html

<a href="#readme">
    <img alt="logo" width="30%" src="">


.. |License| image:: :target:


Rate & Review

Great Documentation0
Easy to Use0
Highly Customizable0
Bleeding Edge0
Responsive Maintainers0
Poor Documentation0
Hard to Use0
Unwelcoming Community0