Synthetic Data SDK ✨

Platform

This Open Source SDK allows you to locally create, browse and manage key resources of the MOSTLY AI Data Intelligence Platform: Generators, Synthetic Datasets, and Connectors

Star on GitHub

Overview

The Synthetic Data SDK is an Open Source Python toolkit for creating high-fidelity, privacy-safe Synthetic Data.

LOCAL mode trains and generates synthetic data locally on your own compute resources.
CLIENT mode connects to a MOSTLY AI Data Intelligence Platform for training & generating synthetic data.
Generators, that were trained locally, can be easily imported to the Platform for sharing.

The SDK allows you to programmatically create, browse and manage three key resources:

Generators - Train a synthetic data generator on your existing tabular or language data
Synthetic Datasets - Use a generator to create any number of synthetic samples
Connectors - Connect to any data source in your organization, for reading and writing data

Simple Installation

pip install mostlyai

Go to the Package Documentation

Quick Start

If you want to use the SDK with the MOSTLY AI Data Intelligence Platform, please obtain your personal API key from your account settings page, and adjust the following code snippet.

In order to run the SDK locally, initialize the SDK in local mode.

!pip install -U mostlyai

# initialize the SDK
from mostlyai.sdk import MostlyAI
mostly = MostlyAI()

# train a generator
g = mostly.train(data="/path/to/data")

# inspect generator quality
g.reports(display=True)

# generate any number of new privacy-safe samples
mostly.probe(g, size=1_000_000)

# generate new synthetic samples to your needs
mostly.probe(g, seed=[{'age': 65, 'gender': 'male'}])

# export and share your generator
g.export_to_file()

Key Features

Broad Data Support

Mixed-type data (categorical, numerical, geospatial, text, etc.)

Single-table, multi-table, and time-series

Multiple Model Types

TabularARGN for SOTA tabular performance

Fine-tune HuggingFace-based language models

Efficient LSTM for text synthesis from scratch

Advanced Training Options

GPU/CPU support

Differential Privacy

Progress Monitoring

Automated Quality Assurance

Quality metrics for fidelity and privacy

In-depth HTML reports for visual analysis

Flexible Sampling

Up-sample to any data volumes

Conditional generation by any columns

Re-balance underrepresented segments

Context-aware data imputation

Statistical fairness controls

Rule-adherence via temperature

Seamless Integration

Connect to external data sources (DBs, cloud storages)

Fully permissive open-source license

Synthetic Data SDK ✨ Tutorials

Our documentation comes with a full set of Tutorials that show various ways of working with the Synthetic Data SDK ✨ . Simply clone the tutorial repository to your own environment and run it locally via Jupyter Lab.

Want to learn more about how synthetic data can help you?