tex

textdescriptives

A Python library for calculating a large variety of statistics from text

Showing:

Popularity

Downloads/wk

0

GitHub Stars

83

Maintenance

Last Commit

11d ago

Contributors

5

Package

Dependencies

4

License

Apache License 2.0

Categories

Readme

spacy github actions pytest github actions docs github coverage

TextDescriptives

A Python library for calculating a large variety of statistics from text(s) using spaCy v.3 pipeline components and extensions. TextDescriptives can be used to calculate several descriptive statistics, readability metrics, and metrics related to dependency distance.

🔧 Installation

pip install textdescriptives

📰 News

  • TextDescriptives has been completely re-implemented using spaCy v.3.0. The stanza implementation can be found in the stanza_version branch and will no longer be maintained.
  • Check out the brand new documentation here!

👩‍💻 Usage

Import the library and add the component to your pipeline using the string name of the "textdescriptives" component factory:

import spacy
import textdescriptives as td
nlp = spacy.load("en_core_web_sm")
nlp.add_pipe("textdescriptives") 
doc = nlp("The world is changed. I feel it in the water. I feel it in the earth. I smell it in the air. Much that once was is lost, for none now live who remember it.")

# access some of the values
doc._.readability
doc._.token_length

TextDescriptives includes convenience functions for extracting metrics to a Pandas DataFrame or a dictionary.

td.extract_df(doc)
# td.extract_dict(doc)
texttoken_length_meantoken_length_mediantoken_length_stdsentence_length_meansentence_length_mediansentence_length_stdsyllables_per_token_meansyllables_per_token_mediansyllables_per_token_stdn_tokensn_unique_tokensproportion_unique_tokensn_charactersn_sentencesflesch_reading_easeflesch_kincaid_gradesmoggunning_fogautomated_readability_indexcoleman_liau_indexlixrixdependency_distance_meandependency_distance_stdprop_adjacent_dependency_relation_meanprop_adjacent_dependency_relation_stdpos_prop_DTpos_prop_NNpos_prop_VBZpos_prop_VBNposprop.pos_prop_PRPpos_prop_VBPpos_prop_INpos_prop_RBpos_prop_VBDposprop,pos_prop_WP
0The world (...)3.2857131.54127763.098391.0857110.36811735230.6571431215107.879-0.04857145.683923.94286-2.45429-0.70857112.71430.41.695240.4222820.443810.08636790.0975610.1219510.04878050.04878050.1219510.1707320.1219510.1219510.07317070.02439020.02439020.0243902

Set which group(s) of metrics you want to extract using the metrics parameter (one or more of readability, dependency_distance, descriptive_stats, pos_stats, defaults to all)

If extract_df is called on an object created using nlp.pipe it will format the output with 1 row for each document and a column for each metric. Similarly, extract_dict will have a key for each metric and values as a list of metrics (1 per doc).

docs = nlp.pipe(['The world is changed. I feel it in the water. I feel it in the earth. I smell it in the air. Much that once was is lost, for none now live who remember it.',
            'He felt that his whole life was some kind of dream and he sometimes wondered whose it was and whether they were enjoying it.'])

td.extract_df(docs, metrics="dependency_distance")
textdependency_distance_meandependency_distance_stdprop_adjacent_dependency_relation_meanprop_adjacent_dependency_relation_std
0The world (...)1.695240.4222820.443810.0863679
1He felt (...)2.5600.440

The text column can by exluded by setting include_text to False.

Using specific components

The specific components (descriptive_stats, readability, dependency_distance and pos_stats) can be loaded individually. This can be helpful if you're only interested in e.g. readability metrics or descriptive statistics and don't want to run the dependency parser or part-of-speech tagger.

nlp = spacy.blank("da")
nlp.add_pipe("descriptive_stats")
docs = nlp.pipe(['Da jeg var atten, tog jeg patent på ild. Det skulle senere vise sig at blive en meget indbringende forretning',
            "Spis skovsneglen, Mulle. Du vil jo gerne være med i hulen, ikk'?"])

# extract_df is clever enough to only extract metrics that are in the Doc
td.extract_df(docs, include_text = False)
token_length_meantoken_length_mediantoken_length_stdsentence_length_meansentence_length_mediansentence_length_stdsyllables_per_token_meansyllables_per_token_mediansyllables_per_token_stdn_tokensn_unique_tokensproportion_unique_tokensn_charactersn_sentences
04.432.59615101011.6510.85293620190.95902
143.52.449496631.5833310.86200712121532

Available attributes

The table below shows the metrics included in TextDescriptives and their attributes on spaCy's Doc, Span, and Token objects. For more information, see the docs.

AttributeComponentDescription
Doc._.token_lengthdescriptive_statsDict containing mean, median, and std of token length.
Doc._.sentence_lengthdescriptive_statsDict containing mean, median, and std of sentence length.
Doc._.syllablesdescriptive_statsDict containing mean, median, and std of number of syllables per token.
Doc._.countsdescriptive_statsDict containing the number of tokens, number of unique tokens, proportion unique tokens, and number of characters in the Doc.
Doc._.pos_proportionspos_statsDict of {pos_prop_POSTAG: proportion of all tokens tagged with POSTAG}. Does not create a key if no tokens in the document fit the POSTAG.
Doc._.readabilityreadabilityDict containing Flesch Reading Ease, Flesch-Kincaid Grade, SMOG, Gunning-Fog, Automated Readability Index, Coleman-Liau Index, LIX, and RIX readability metrics for the Doc.
Doc._.dependency_distancedependency_distanceDict containing the mean and standard deviation of the dependency distance and proportion adjacent dependency relations in the Doc.
Span._.token_lengthdescriptive_statsDict containing mean, median, and std of token length in the span.
Span._.countsdescriptive_statsDict containing the number of tokens, number of unique tokens, proportion unique tokens, and number of characters in the span.
Span._.pos_proportionspos_statsDict of {pos_prop_POSTAG: proportion of all tokens tagged with POSTAG}. Does not create a key if no tokens in the span fit the POSTAG.
Span._.dependency_distancedependency_distanceDict containing the mean dependency distance and proportion adjacent dependency relations in the Doc.
Token._.dependency_distancedependency_distanceDict containing the dependency distance and whether the head word is adjacent for a Token.

Authors

Developed by Lasse Hansen (@HLasse) at the Center for Humanities Computing Aarhus

Collaborators:

Rate & Review

Great Documentation0
Easy to Use0
Performant0
Highly Customizable0
Bleeding Edge0
Responsive Maintainers0
Poor Documentation0
Hard to Use0
Slow0
Buggy0
Abandoned0
Unwelcoming Community0
100