Synthetic Data SDK ✨

This Open Source SDK allows you to locally create, browse and manage the three key resources of the MOSTLY AI Platform: Generators, Synthetic Datasets, Connectors
Star on GitHub

Overview

The Synthetic Data SDK is an Open Source Python toolkit for creating high-fidelity, privacy-safe Synthetic Data.

  • LOCAL mode trains and generates synthetic data locally on your own compute resources.
  • CLIENT mode connects to a MOSTLY AI Platform for training & generating synthetic data.
  • Generators, that were trained locally, can be easily imported to the Platform for sharing.

The SDK allows you to programmatically create, browse and manage three key resources:

  1. Generators - Train a synthetic data generator on your existing tabular or language data
  2. Synthetic Datasets - Use a generator to create any number of synthetic samples
  3. Connectors - Connect to any data source in your organization, for reading and writing data

Simple Installation

pip install mostlyai

Quick Start

If you want to use the SDK with the MOSTLY AI Platform, please obtain your personal API key from your account settings page, and adjust the following code snippet.

In order to run the SDK locally, initialize the SDK in local mode.
!pip install -U mostlyai

# initialize the SDK
from mostlyai.sdk import MostlyAI
mostly = MostlyAI()

# train a generator
g = mostly.train(data="/path/to/data")

# inspect generator quality
g.reports(display=True)

# generate any number of new privacy-safe samples
mostly.probe(g, size=1_000_000)

# generate new synthetic samples to your needs
mostly.probe(g, seed=[{'age': 65, 'gender': 'male'}])

# export and share your generator
g.export_to_file()

Key Features

Broad Data Support
Mixed-type data (categorical, numerical, geospatial, text, etc.)
Single-table, multi-table, and time-series
Multiple Model Types
TabularARGN for SOTA tabular performance
Fine-tune HuggingFace-based language models
Efficient LSTM for text synthesis from scratch
Advanced Training Options
GPU/CPU support
Differential Privacy
Progress Monitoring
Automated Quality Assurance
Quality metrics for fidelity and privacy
In-depth HTML reports for visual analysis
Flexible Sampling
Up-sample to any data volumes
Conditional generation by any columns
Re-balance underrepresented segments
Context-aware data imputation
Statistical fairness controls
Rule-adherence via temperature
Seamless Integration
Connect to external data sources (DBs, cloud storages)
Fully permissive open-source license

Synthetic Data SDK ✨ in our Tutorials

Our documentation comes with a full set of Tutorials that show various ways of working with the Synthetic Data SDK ✨ . Simply clone the tutorial repository to your own environment and run it locally via Jupyter Lab.
Colab Tutorials
Want to learn more about how synthetic data can help you?
magnifiercross