A python module for English lemmatization and inflection.
LemmInflect uses a dictionary approach to lemmatize English words and inflect them into forms specified by a user supplied Universal Dependencies or Penn Treebank tag. The library works with out-of-vocabulary (OOV) words by applying neural network techniques to classify word forms and choose the appropriate morphing rules.
The system acts as a standalone module or as an extension to the spaCy NLP system.
The dictionary and morphology rules are derived from the NIH's SPECIALIST Lexicon which contains an extensive set information on English word forms.
A more simplistic inflection only system is available as pyInflect. LemmInflect was created to address some of the shortcoming of that project and add features, such as...
For the latest documentation, see ReadTheDocs.
The accuracy of LemmInflect and several other popular NLP utilities was tested using the Automatically Generated Inflection Database (AGID) as a baseline. The AGID has an extensive list of lemmas and their corresponding inflections. Each inflection was lemmatized by the test software and then compared to the original value in the corpus. The test included 119,194 different inflected words.
| Package | Verb | Noun | ADJ/ADV | Overall | Speed | |----------------------------------------------------------------| | LemmInflect | 96.1% | 95.4% | 93.9% | 95.6% | 42.0 uS | | CLiPS/pattern.en | 93.6% | 91.1% | 0.0% | n/a | 3.0 uS | | Stanford CoreNLP | 87.6% | 93.1% | 0.0% | n/a | n/a | | spaCy | 79.4% | 88.9% | 60.5% | 84.7% | 5.0 uS | | NLTK | 53.3% | 52.2% | 53.3% | 52.6% | 13.0 uS | |----------------------------------------------------------------|
The only external requirement to run LemmInflect is
numpy which is used for the matrix math that drives the neural nets. These nets are relatively small and don't require significant CPU power to run.
To install do..
pip3 install lemminflect
The project was built and tested under Python 3 and Ubuntu but should run on any Linux, Windows, Mac, etc.. system. It is untested under Python 2 but may function in that environment with minimal or no changes.
The code base also includes library functions and scripts to create the various data files and neural nets. This includes such things as...
None of these are required for run-time operation. However, if you want of modify the system, see the documentation for more info.
To lemmatize a word use the method
getLemma(). This takes a word and a Universal Dependencies tag and returns the lemmas as a list of possible spellings. The dictionary system is used first, and if no lemma is found, the rules system is employed.
> from lemminflect import getLemma getLemma('watches', upos='VERB') ('watch',)
To inflect words, use the method
getInflection. This takes a lemma and a Penn Treebank tag and returns a tuple of the specific inflection(s) associated with that tag. Similary to above, the dictionary is used first and then inflection rules are applied if needed..
> from lemminflect import getInflection > getInflection('watch', tag='VBD') ('watched',) > getInflection('xxwatch', tag='VBD') ('xxwatched',)
To use as an extension, you need spaCy version 2.0 or later. Versions 1.9 and earlier do not support the extension methods used here.
To setup the extension, first import
lemminflect. This will create new
inflect methods for each spaCy
Token. The methods operate similarly to the methods described above, with the exception that a string is returned, containing the most common spelling, rather than a tuple.
import spacy import lemminflect nlp = spacy.load('en_core_web_sm') doc = nlp('I am testing this example.') doc._.lemma() test doc._.inflect('NNS') examples
If you find a bug, please report it on the GitHub issues list. However be aware that when in comes to returning the correct inflection there are a number of different types of issues that can arise. Some of these are not readily fixable. Issues with inflected forms include...
One common issue is that some forms of the verb "be" are not completely specified by the treekbank tag. For instance be/VBD inflects to either "was" or "were" and be/VBP inflects to either "am", or "are". In order to disambiguate these forms, other words in the sentence need to be inspected. At this time, LemmInflect doesn't include this functionality.