A dataclass container with multi-indexing and bulk operations. Provides the typed benefits and ergonomics of dataclasses while having the efficiency of Pandas dataframes.
The container is based on data-oriented design by optimising the memory layout of the stored data, providing fast bulk operations and a smaller memory footprint for large collections. Bulk operations are enabled using Pandas which has a rich set of vectorised methods for both numerical and string data types.
Multi-indexing provides the ability to use multiple fields as keys to index the records. This is suitable for bidirectional and inverse dictionary keys.
A DataClassFrame provides good ergonomics for production code as columns are immutable and columns/data types are well defined by the dataclasses. This makes it easier for users to understand the "shape" of the data in large projects and refactor when necessary.
Get the latest version using pip/PyPi
pip install dataclassframe
|Container||Positional indexing||Key indexing||Multi-key indexing||Data-oriented design||Column-wise opperations||Type hints||Use in prod|
A container data-type for dataclasses...
from dataclasses import dataclass from dataclassframe import DataClassFrame class ExampleDC: field1: str field2: int records = [ ExampleDC('a', 1), ExampleDC('b', 2), ExampleDC('c', 3), ] dcf = DataClassFrame( record_class=ExampleDC, data=records, index=['field1', 'field2'] )
Which acts like a ordered dictionary with multi-indexing...
# Obtain record `ExampleDC('b', 2)` row_idx = dcf.iat # Using positional index row_f1 = dcf.at['b'] # Using index of `field1` row_f2 = dcf.at[:, 2] # Using index of `field2` assert row_idx == row_f1 == row_f2
With bulk operations on the columns..
assert dcf.cols.field2.sum() == 6
Works nicely with Python 3 type hints...
dcf: DataClassFrame[ExampleDC] dcf.iat: ExampleDC
It's no secret that under the hood DataClassFrames are using Pandas DataFrames to store data. The data is converted where possible to Pandas Series, which in turn use Numpy arrays. When the user accesses a record the data is then converted back into the dataclass provided at initialisation.
Pandas provides many advantages over of using a simple list of dataclasses or similar such as better memory footprint and fast vectorised operations. However using Pandas DataFrames directly in production code is considered by the author and others as an anti-pattern. Specifically as DataFrames are column-wise mutable and therefore difficult to determine at code-time what columns the dataframe contains i.e. its shape. It also does not provide any type-hinting benefits.
All notable changes to this project will be documented here.
© Josh Levy-Kramer 2020. dataclassframe is released under the MIT license.