CTGAN is a collection of Deep Learning based synthetic data generators for single table data, which are able to learn from real data and generate synthetic data with high fidelity.
Important Links | |
---|---|
💻 Website | Check out the SDV Website for more information about our overall synthetic data ecosystem. |
📙 Blog | A deeper look at open source, synthetic data creation and evaluation. |
📖 Documentation | Quickstarts, User and Development Guides, and API Reference. |
:octocat: Repository | The link to the Github Repository of this library. |
📜 License | This library is published under the MIT License. |
⌨️ Development Status | This software is in its Pre-Alpha stage. |
![]() | Join our Slack Workspace for announcements and discussions. |
Currently, this library implements the CTGAN and TVAE models described in the Modeling Tabular data using Conditional GAN paper, presented at the 2019 NeurIPS conference.
⚠️ If you're just getting started with synthetic data, we recommend installing the SDV library which provides user-friendly APIs for accessing CTGAN. ⚠️
The SDV library provides wrappers for preprocessing your data as well as additional usability features like constraints. See the SDV documentation to get started.
Alternatively, you can also install and use CTGAN directly, as a standalone library:
Using pip
:
pip install ctgan
Using conda
:
conda install -c pytorch -c conda-forge ctgan
When using the CTGAN library directly, you may need to manually preprocess your data into the correct format, for example:
In this example we load the Adult Census Dataset* which is a built-in demo dataset. We use CTGAN to learn from the real data and then generate some synthetic data.
from ctgan import CTGAN
from ctgan import load_demo
real_data = load_demo()
# Names of the columns that are discrete
discrete_columns = [
'workclass',
'education',
'marital-status',
'occupation',
'relationship',
'race',
'sex',
'native-country',
'income'
]
ctgan = CTGAN(epochs=10)
ctgan.fit(real_data, discrete_columns)
# Create synthetic data
synthetic_data = ctgan.sample(1000)
*For more information about the dataset see: Dua, D. and Graff, C. (2019). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.
Join our Slack channel to discuss more about CTGAN and synthetic data. If you find a bug or have a feature request, you can also open an issue on our GitHub.
Interested in contributing to CTGAN? Read our Contribution Guide to get started.
If you use CTGAN, please cite the following work:
Lei Xu, Maria Skoularidou, Alfredo Cuesta-Infante, Kalyan Veeramachaneni. Modeling Tabular data using Conditional GAN. NeurIPS, 2019.
@inproceedings{ctgan,
title={Modeling Tabular data using Conditional GAN},
author={Xu, Lei and Skoularidou, Maria and Cuesta-Infante, Alfredo and Veeramachaneni, Kalyan},
booktitle={Advances in Neural Information Processing Systems},
year={2019}
}
Please note that these projects are external to the SDV Ecosystem. They are not affiliated with or maintained by DataCebo.
The Synthetic Data Vault Project was first created at MIT's Data to AI Lab in 2016. After 4 years of research and traction with enterprise, we created DataCebo in 2020 with the goal of growing the project. Today, DataCebo is the proud developer of SDV, the largest ecosystem for synthetic data generation & evaluation. It is home to multiple libraries that support synthetic data, including:
Get started using the SDV package -- a fully integrated solution and your one-stop shop for synthetic data. Or, use the standalone libraries for specific needs.
This release renames the models in CTGAN. CTGANSynthesizer
is now called CTGAN
and TVAESynthesizer
is now called TVAE
.
This release updates CTGAN to use the latest version of RDT. It also includes performance and robustness updates to the data transformer.
This release fixes a bug with the decoder instantiation, and also allows users to set a random state for the model fitting and sampling.
This release adds support for Python 3.9 and updates dependencies to ensure compatibility with the rest of the SDV ecosystem, and upgrades to the latests RDT release.
CTGAN
code - Issue #158 by @ori-katz100 and @fealhoDependency upgrades to ensure compatibility with the rest of the SDV ecosystem.
In this release, the way in which the loss function of the TVAE model was computed has been fixed.
In addition, the default value of the discriminator_decay
has been changed to a more optimal
value. Also some improvements to the tests were added.
TVAE
: loss function - Issue #143 by @fealho and @DingfanChendiscriminator_decay
to 1e-6
- Pull request #145 by @fealhoThis release exposes all the hyperparameters which the user may find useful for both CTGAN
and TVAE
. Also TVAE
can now be fitted on datasets that are shorter than the batch
size and drops the last batch only if the data size is not divisible by the batch size.
TVAE
: Adapt batch_size
to data size - Issue #135 by @fealho and @csalaValueError
from validate_discre_columns
with uniqueCombinationConstraint
- Issue 133 by @fealho and @MLjunggMaintenance relese to upgrade dependencies to ensure compatibility with the rest of the SDV libraries.
Also add a validation on the CTGAN condition_column
and condition_value
inputs.
In this release we add a new TVAE model which was presented in the original CTGAN paper. It also exposes more hyperparameters and moves epochs and log_frequency from fit to the constructor.
A new verbose argument has been added to optionally disable unnecessary printing, and a new hyperparameter
called discriminator_steps
has been added to CTGAN to control the number of optimization steps performed
in the discriminator for each generator epoch.
The code has also been reorganized and cleaned up for better readability and interpretability.
Special thanks to @Baukebrenninkmeijer @fealho @leix28 @csala for the contributions!
log_frequency
to __init__
- Issue #102 by @fealhoIn this release we introduce several minor improvements to make CTGAN more versatile and propertly support new types of data, such as categorical NaN values, as well as conditional sampling and features to save and load models.
Additionally, the dependency ranges and python versions have been updated to support up to date runtimes.
Many thanks @fealho @leix28 @csala @oregonpillow and @lurosenb for working on making this release possible!
Minor version including changes to ensure the logs are properly printed and the option to disable the log transformation to the discrete column frequencies.
Special thanks to @kevinykuo for the contributions!
Reorganization of the project structure with a new Python API, new Command Line Interface and increased data format support.
First Release - NeurIPS 2019 Version.
Version | Tag | Published |
---|---|---|
0.6.0 | 4mos ago | |
0.6.0.dev0 | 4mos ago | |
0.5.3.dev0 | 4mos ago | |
0.5.2 | 5mos ago |