tex

textpack

Group thousands of similar spreadsheet or database text entries in seconds

Showing:

Popularity

Downloads/wk

0

GitHub Stars

113

Maintenance

Last Commit

2yrs ago

Contributors

1

Package

Dependencies

6

License

Categories

Readme

What is this?

TextPack efficiently groups similar values in large (or small) datasets. Under the hood, it builds a document term matrix of n-grams assigned a TF-IDF score. It then uses matrix multiplication to calculate the cosine similarity between these values. For a technical explination, I wrote a blog post.

Why do I care?

If you're a analyst, journalist, data scientist or similar and have ever had a spreadsheet, SQL table or JSON string filled with inconsistent inputs like this:

rowfullname
1John F. Doe
2Esquivel, Mara
3Doe, John F
4Whyte, Luke
5Doe, John Francis

And you want to perform some kind of analysis – perhaps in a Pivot Table or a Group By statement – but are hindered by the deviations in spelling and formatting, you can use TextPack to comb thousands of cells in seconds and create a third column like this:

rowfullnamename_groups
1John F. DoeDoe John F
2Esquivel, MaraEsquivel Mara
3Doe, John FDoe John F
4Whyte, LukeWhyte Luke
5Doe, John FrancisDoe John F

We can then group by name_groups and perform our analysis.

You can also group across multiple columns. For instance, given the following:

rowmakemodel
1ToyotaCamry
2toytacamry DXV
3FordF-150
4ToyotaTundra
5HondaAccord

You can group across make and model to create:

rowmakemodelcar_groups
1ToyotaCamrytoyotacamry
2toytacamry DXVtoyotacamry
3FordF-150fordf150
4ToyotaTundratoyotatundra
5HondaAccordhondaaccord

How do I use it?

Installation

pip install textpack

Import module

from textpack import tp

Instantiate TextPack

tp.TextPack(df, columns_to_group, match_threshold=0.75, ngram_remove=r'[,-./]', ngram_length=3)

Class parameters:

  • df (required): A Pandas' DataFrame containing the dataset to group
  • columns_to_group (required): A list or string matching the column header(s) you'd like to parse and group
  • match_threshold (optional): This is a floating point number between 0 and 1 that represents the cosine similarity threshold we'll use to determine if two strings should be grouped. The closer the threshold to 1, the higher the similarity will need to be to be considered a match.
  • ngram_remove (optional): A regular expression you can use to filter characters out of your strings when we build our n-grams.
  • ngram_length (optional): The length of our n-grams. This can be used in tandem with match_threshold to find the sweet spot for grouping your dataset. If TextPack is running slow, it's usually a sign to consider raising the n-gram length.

TextPack can also be instantiated using the following helpers, each of which is just a wrapper that converts a data format to a Pandas DataFrame and then passes it to TextPack. Thus, they all require a file path, columns_to_group and take the same three optional parameters as calling TextPack directly.

tp.read_csv(csv_path, columns_to_group, match_threshold=0.75, ngram_remove=r'[,-./]', ngram_length=3)
tp.read_excel(excel_path, columns_to_group, sheet_name=None, match_threshold=0.75, ngram_remove=r'[,-./]', ngram_length=3)
tp.read_json(json_path, columns_to_group, match_threshold=0.75, ngram_remove=r'[,-./]', ngram_length=3)

Run Textpack and group values

TextPack objects have the following public properties:

  • df: The dataframe used internally by TextPack – manipulate as you see fit
  • group_lookup: A Python dictionary built by build_group_lookup and then used by add_grouped_column_to_data to lookup each value that has a group. It looks like this:
{ 
    'John F. Doe': 'Doe John F',
    'Doe, John F': 'Doe John F',
    'Doe, John Francis': 'Doe John F'
}

Textpack objects also have the following public methods:

  • build_group_lookup(): Runs the cosine similarity analysis and builds group_lookup.

  • add_grouped_column_to_data(column_name='Group'): Uses vectorization to map values to groups via group_lookup and add the new column to the DataFrame. The column header can be set via column_name.

  • set_match_threshold(match_threshold): Modify the match threshold internally.

  • set_ngram_remove(ngram_remove): Modify the n-gram regex filter internally.

  • set_ngram_length(ngram_length): Modify the n-gram length internally.

  • run(column_name='Group'): A helper function that calls build_group_lookup followed by add_grouped_column_to_data.

    Export our grouped dataset

  • export_json(export_path)

  • export_csv(export_path)

A simple example

from textpack import tp

cars = tp.read_csv('./cars.csv', ['make', 'model'], match_threshold=0.8, ngram_length=5)

cars.run()

cars.export_csv('./cars-grouped.csv')

Troubleshooting

I'm getting a Memory Error!

Some users have triggered memory errors when parsing big data sets. This StackOverflow post has proved useful.

How does it work?

As mentioned above, under the hood, we're building a document term matrix of n-grams assigned a TF-IDF score. We're then using matrix multipcation to quickly calculate the cosine similarity between these values.

I wrote this blog post to explian how TextPack works behind the scene. Check it out!

Rate & Review

Great Documentation0
Easy to Use0
Performant0
Highly Customizable0
Bleeding Edge0
Responsive Maintainers0
Poor Documentation0
Hard to Use0
Slow0
Buggy0
Abandoned0
Unwelcoming Community0
100