Data embedding API based on the Variational Autoencoder (VAE), originally proposed by Kingma and Welling https://arxiv.org/abs/1312.6114.
This tool, implemented in TensorFlow 1.x, is designed to work similar to familiar dimensionality reduction methods such as scikit-learn's t-SNE or UMAP, but also go beyond their capabilities in some notable ways, making full use of the VAE as a generative model.
While I decided to call the tool itself CompressionVAE, or CVAE for short, I mainly chose this to give it a unique name. In practice, it is based on a standard VAE, with the (optional) addition of Inverse Autoregressive Flow (IAF) layers to allow for more flexible posterior distributions. For details on the IAF layers, I refer you to the original paper: https://arxiv.org/pdf/1606.04934.pdf.
CompressionVAE has several unique advantages over the common manifold learning methods like t-SNE and UMAP:
CompressionVAE is distributed through PyPI under the name
cvae (https://pypi.org/project/cvae/). To install the latest version, simply run
pip install cvae
Alternatively, to locally install CompressionVAE, clone this repository and run the following command from the CompressionVAE root directory.
pip install -e .
To use CVAE to learn an embedding function, we first need to import the cvae library.
from cvae import cvae
When creating a CompressionVAE object for a new model, it needs to be provided a training dataset. For small datasets that fit in memory we can directly follow the sklean convention. Let's look at this case first and take MNIST as an example.
First, load the MNIST data. (Note: this example requires scikit-learn which is not installed with CVAE. You might have to install it first by running
pip install sklearn.)
from sklearn.datasets import fetch_openml mnist = fetch_openml('mnist_784', version=1, cache=True) X = mnist.data
Now we can create a CompressionVAE object/model based on this data. The minimal code to do this is
embedder = cvae.CompressionVAE(X)
By default, this creates a model with two-dimensional latent space, splits the data X randomly into 90% train and 10% validation data, applies feature normalization, and tries to match the model architecture to the input and latent feature dimensions. It also saves the model in a temporary directory which gets overwritten the next time you create a new CVAE object there.
We will look at customising all this later, but for now let's move on to training.
Once a CVAE object is initialised and associated with data, we can train the embedder using its
train method. This works similar to t-SNE or UMAP's
In the simplest case, we just run
This will train the model, applying automatic learning rate scheduling based on the validation data loss, and stop either when the model converges or after 50k training steps. We can also stop the training process early through a KeyboardInterrupt (ctrl-c or 'interrupt kernel' in Jupyter notebook). The model will be saved at this point.
It is also possible to stop training and then re-start with different parameters (see more details below).
One note/warning: At the moment, the model can be quite sensitive to initialization (in some rare cases even giving NAN losses). Re-initializing/training the model can improve the results if a training run did not give satisfactory results.
Once we have a trained model (well, technically even before training, but the results would be random), we can use CVAE to compress data, embedding it into the latent space.
To do this, we use CVAE's
To embed the entire MNIST data:
z = embedder.embed(X)
But note that other than t-SNE or UMAP, this data does not have to be the same as the training data. It can be new and previously unseen data.
For two-dimensional latent spaces, CVAE comes with a built-in visualization method,
visualize. It provides a two-dimensional plot of the embeddings, including class information if available.
To visualize the MNIST embeddings and color them by their respective class, we can run
embedder.visualize(z, labels=[int(label) for label in mnist.target])
We could also passed the string labels
mnist.target directly to
labels, but in that case they would not necessarily be ordered from 0 to 9.
Optionally, if we pass
labels as a list of integers like above, we can also pass the
categories parameter, a list of strings associating names with the labels. In the case of MNIST this is irrelevant since the label and class names are the same.
By default the
visualize simply displays the plot. By setting the
filename parameter we can alternatively save the plot to a file.
Finally, we can use CVAE as a generative model, generating data by decoding arbitrary latent vectors using the
If we simply want to 'undo' our MNIST embedding and try to re-create the input data, we can run our embeddings
z through the
X_reconstructed = embedder.decode(z)
As a more interesting example, we can use this for data interpolation. Let's say we want to create the data that's halfway between the first and the second MNIST datapoint (a '5' and a '0' respectively). We can achieve this with the following code
import numpy as np # Combine the two examples and add batch dimension z_interp = np.expand_dims(0.5*z + 0.5*z, axis=0) # Decode the new latent vector. X_interp = embedder.decode(z_interp)
In the case of image data, such as MNIST, CVAE also has a method that allows us to quickly visualize the latent space as seen through the decoder. To plot a 20 by 20 grid of reconstructed images, spanning the latent space region [-4, 4] in both x and y, we can run
embedder.visualize_latent_grid(xy_range=(-4.0, 4.0), grid_size=20, shape=(28, 28))
The example above shows the simplest usage of CVAE. However, if desired a user can take much more control over the system and customize the model and training processes.
In the previous example we initialised a CompressionVAE with default parameters. If we want more control, we might want to initialise it the following way:
embedder = cvae.CompressionVAE(X, train_valid_split=0.99, dim_latent=16, iaf_flow_length=10, cells_encoder=[512, 256, 128], initializer='lecun_normal', batch_size=32, batch_size_test=128, logdir='~/mnist_16d', feature_normalization=False, tb_logging=True)
train_valid_split controls the random splitting into train and test data. Here 99% of X is used for training, and only 1% is reserved for validation.
Alternatively, to get more control over the data the user can also provide
X_valid as an input. In this case
train_valid_split is ignored and the model uses
X for training and
X_valid for validation.
dim_latent specifies the dimensionality of the latent space.
iaf_flow_length controls how many IAF flow layers the model has.
cells_encoder determines the number, as well as size of the encoders fully connected layers. In the case above, we have three layers with 512, 256, and 128 units respectively. The decoder uses the mirrored version of this.
If this parameter is not set, CVAE creates a two layer network with sizes adjusted to the input dimension and latent dimension. The logic behind this is very handwavy and arbitrary for now, and I generally recommend setting this manually.
initializer controls how the model weights are initialized, with options being
batch_size_test determine the batch sizes used for training and testing respectively.
logdir specifies the path to the model, and also acts as the model name. The default,
'temp', gets overwritten every time it is used, but other model names can be used to save and restore models for later use or even to continue training.
feature_normalization tells CVAE whether it should internally apply feature normalization (zero mean, unit variance, based on the training data) or not. If True, the normalisation factors are stored with the model and get applied to any future data.
tb_logging determines whether the model writes summaries for TensorBoard during the training process.
In the simple example we called the
train method without any parameter. A more advanced call might look like
embedder.train(learning_rate=1e-4, num_steps=2000, dropout_keep_prob=0.6, test_every=50, lr_scheduling=False)
learning_rate sets the initial learning rate of training.
num_steps controls the maximum number of training steps before stopping.
dropout_keep_prob determines the keep probability for dropout in the fully connected layers.
test_every sets the frequency of test steps.
lr_scheduling controls whether learning rate scheduling is applied. If
False, training continues at
num_steps is reached.
For more arguments/details, for example controlling the details of the learning rate scheduler and the convergence criteria, check the method definition.
Alternatively to providing the input data
X as a single numpy array, as done with t-SNE and UMAP, CVAE also allows for much larger datasets that do not fit into a single array.
To prepare such a dataset, create a new directory, e.g.
'~/my_dataset', and save the training data as individual npy files per example in this directory.
(Note: the data can also be saved in nested sub-directories, for example one directory per category. CVAE will look through the entire directory tree for npy files.)
When initialising a model based on this kind of data pass the root directory of the dataset as
embedder = cvae.CompressionVAE(X='~/my_dataset')
Initialising will take slightly longer than if
X is passed as an array, even for the same number of data points. But this method scales in principle to arbitrarily large datasets, and only loads one batch at a time during training.
If a CompressionVAE object is initialized with
logdir='temp' it always starts from a new untrained model, possible overwriting any previous temp model.
However, if a different
logdir is chosen, the model persists and can be reloaded.
If CompressionVAE is initialized with a
logdir that already exists and contains parameter and checkpoint files of a previous model, it attempts to restore that model's checkpoints.
In this case, any model specific input parameter (e.g.
cells_encoder) is ignored in favor of the original models parameters.
A restored model can be use straight away to embed or generate data, but it is also possible to continue the training process, picking up from the most recent checkpoint.