This python module provides code for training popular clustering models on large datasets. We focus on Bayesian nonparametric models based on the Dirichlet process, but also provide parametric counterparts.
bnpy supports the latest online learning algorithms as well as standard offline methods. Our aim is to provide an inference platform that makes it easy for researchers and practitioners to compare models and algorithms.
FiniteMixtureModel: fixed number of clusters
DPMixtureModel: infinite number of clusters, via the Dirichlet process
Topic models (aka admixtures models)
FiniteTopicModel: fixed number of topics. This is Latent Dirichlet allocation.
HDPTopicModel: infinite number of topics, via the hierarchical Dirichlet process
Hidden Markov models (HMMs)
FiniteHMM: Markov sequence model with a fixture number of states
HDPHMM: Markov sequence models with an infinite number of states
ZeroMeanGauss: Zero-mean, full-covariance
These are all variants of variational inference, a family of optimization algorithms. We plan to eventually support sampling methods (Markov chain Monte Carlo) too.
You can find many examples of bnpy in action in our curated Example Gallery.
These same demos are also directly available as Python scrips inside the examples/ folder of the project Github repository.
You can use bnpy from a command line/terminal, or from within Python. Both options require specifying a dataset, an allocation model, an observation model (likelihood), and an algorithm. Optional keyword arguments with reasonable defaults allow control of specific model hyperparameters, algorithm parameters, etc.
Below, we show how to call bnpy to train a 8 component Gaussian mixture model on a default toy dataset stored in a .csv file on disk. In both cases, log information is printed to stdout, and all learned model parameters are saved to disk.
$ python -m bnpy.Run /path/to/my_dataset.csv FiniteMixtureModel Gauss EM --K 8 --output_path /tmp/my_dataset/results/
import bnpy bnpy.run('/path/to/dataset.csv', 'FiniteMixtureModel', 'Gauss', 'EM', K=8, output_path='/tmp/my_dataset/results/')
Train Dirichlet-process Gaussian mixture model (DP-GMM) via full-dataset variational algorithm (aka "VB" for variational Bayes).
python -m bnpy.Run /path/to/dataset.csv DPMixtureModel Gauss VB --K 8
Train DP-GMM via memoized variational, with birth and merge moves, with data divided into 10 batches.
python -m bnpy.Run /path/to/dataset.csv DPMixtureModel Gauss memoVB --K 8 --nBatch 10 --moves birth,merge
print help message for required arguments python -m bnpy.Run --help print help message for specific keyword options for Gaussian mixture models python -m bnpy.Run /path/to/dataset.csv FiniteMixtureModel Gauss EM --kwhelp
To use bnpy for the first time, follow the documentation's Installation Instructions.
Assistant Professor (Aug. 2018 - present)
Tufts University, Dept. of Computer Science
University of California, Irvine
Our NIPS 2015 paper describes inference algorithms that can add or remove clusters for the sticky HDP-HMM.
Our AISTATS 2015 paper describes our algorithms for HDP topic models.
Our NIPS 2013 paper introduced memoized variational inference algorithm, and applied it to Dirichlet process mixture models.
Our short paper from a workshop at NIPS 2014 describes the vision for bnpy as a general purpose inference engine.
Primarly, we intend bnpy to be a platform for researchers. By gathering many learning algorithms and popular models in one convenient, modular repository, we hope to make it easier to compare and contrast approaches. We also hope that the modular organization of bnpy enables researchers to try out new modeling ideas without reinventing the wheel.