The Synthetic Data SDK - An open-source python toolkit for high-fidelity privacy-safe synthetic data

Written by


TL;DR: pip install 'mostlyai[local]' → Synthetic Data for ALL

Data is the lifeblood of the digital era, as it captures an ever growing amount of knowledge. Yet, while our future is being increasingly shaped by smart products and services that build on top of data, it is more often than not, that only a few people will ever get the opportunity to access such data at scale.

Our mission at MOSTLY AI is to enable every business and every community to provide privacy-safe data access to all of their members. And today we are taking a big step forward, by announcing the immediate availability of the all new Synthetic Data SDK (https://github.com/mostly-ai/mostlyai), a state-of-the-art, easy-to-use python package for differentially private data synthesis. Anyone can now install and run the SDK locally to train synthetic data generators that capture the vast knowledge of their existing data assets, while protecting the privacy of each and every data subject. These generators can then be broadly and safely shared further, enabling people and algorithms alike to gain insights from data at scale.

Getting Started

Getting started is as easy as executing

pip install 'mostlyai[local]'

within your Python environment. Once installed, you can launch an SDK instance in local mode and train your first synthetic data generator with a few lines of code:

# 1) Initialize SDK in local mode
from mostlyai.sdk import MostlyAI
mostly = MostlyAI(local=True)

# 2) Load some original data
import pandas as pd
repo_url = 'https://github.com/mostly-ai/public-demo-data/raw/dev'
original_df = pd.read_csv(repo_url + '/census/census.csv.gz').sample(n=1_000)

# 3) Train a synthetic data generator
g = mostly.train(name='census', data=original_df)

# 4) Probe the generator for synthetic samples
synthetic_df = mostly.probe(g, size=10)
display(synthetic_df)

# 5) Export the quality assurance reports
g.reports()

Key Features of the Synthetic Data SDK

A wide range of configuration options are available, both for controlling the training as well as the generation of synthetic samples. Key features of the SDK include:

Broad Data Support: Mixed data types (categorical, numerical, geospatial, text, etc.), single/multi-table & time-series.
Multiple Model Types: TabularARGN (SOTA tabular), fine-tuned HuggingFace models, and efficient LSTM for text.
Advanced Training Options: CPU/GPU support, differential privacy, and real-time progress monitoring.
Automated Quality Assurance: Built-in fidelity/privacy metrics and in-depth HTML reports for visual analysis.
Flexible Sampling: Up-sample to any volume, generate data conditionally, rebalance segments, impute context-aware values, ensure fairness, and control outcomes via temperature.
Seamless Integration: Directly connect to external DBs and cloud storage. Integrate and build upon a fully permissive open-source license.

For all details, please check out the SDK package documentation (https://mostly-ai.github.io/mostlyai/). It contains a complete API reference, schema reference, as well as a variety of usage examples and synthetic data tutorials. The latter covers advanced topics from fine-tuning Language Models towards AI Explainability and many more.

Full Platform Interoperability

One key aspect of the SDK is that all locally trained generators are fully compatible with the MOSTLY AI Platform, and thus can be transferred seamlessly to these for further dissemination and exploration. Once imported, the intuitive platform interface allows anyone, independent of their technical background, to use the magic of the AI-powered Assistant to easily generate and analyze synthetic data, as they see fit.

Help us with your feedback

We’ll keep adding more usage examples and synthetic data tutorials over the coming weeks. Meanwhile, we’d love to see what you build with the SDK! Whether you’re sharing or receiving data, our toolkit is here to help everyone create a better, privacy-safe future. Share your feedback, and let’s shape the future of data together.

→ https://github.com/mostly-ai/mostlyai