🚀 MOSTLY AI releases World’s First Industry-Grade Open-Source Toolkit for Synthetic Data
Read all about it here

Synthetic Data SDK ✨

The official Python SDK for MOSTLY AI. This toolkit allows you to programmatically create, browse and manage the 3 key resources of the MOSTLY AI Platform: Generators, Synthetic Datasets, Connectors

Overview

The Synthetic Data SDK is a Python toolkit for high-fidelity, privacy-safe Synthetic Data.

  • LOCAL mode trains and generates synthetic data locally on your own compute resources.
  • CLIENT mode connects to a remote MOSTLY AI platform for training & generating synthetic data there.
  • Generators, that were trained locally, can be easily imported to a platform for further sharing.

The SDK allows you to programmatically create, browse and manage 3 key resources:

  1. Generators - Train a synthetic data generator on your existing tabular or language data assets
  2. Synthetic Datasets - Use a generator to create any number of synthetic samples to your needs
  3. Connectors - Connect to any data source within your organization, for reading and writing data
Intent Primitive API Reference
Train a Generator on tabular or language data g = mostly.train(config) mostly.train
Generate any number of synthetic data records sd = mostly.generate(g, config) mostly.generate
Live probe the generator on demand df = mostly.probe(g, config) mostly.probe
Connect to any data source within your org c = mostly.connect(config) mostly.connect

Simple Installation

The latest release of mostlyai can be installed via pip. Use [local] if you want to generate synthetic data locally in your environment and [local-gpu] to enable local GPU support.
pip install mostlyai
pip install mostlyai[local]
pip install mostlyai[local-gpu]

Quick Start

If you want to use the SDK with the MOSTLY AI Platform, please obtain your personal API key from your account settings page, and adjust the following code snippet.

In order to run the SDK locally, initialize the SDK in local mode.
#!pip install mostlyai
import pandas as pd
from mostlyai.sdk import MostlyAI

# initialize SDK in local mode
# mostly = MostlyAI(local=True)
# or initalize SDK in client mode
mostly = MostlyAI(
    api_key='INSERT_YOUR_API_KEY',   # or set env var `MOSTLYAI_API_KEY`
    base_url='https://app.mostly.ai' # or set env var `MOSTLYAI_BASE_URL`
)

# train a generator on original data
df_original = pd.read_csv('https://github.com/mostly-ai/public-demo-data/raw/dev/census/census.csv.gz')
g = mostly.train(name='census', data=df_original)  # shorthand syntax for 1-table config

# live probe the generator for synthetic samples
df_samples = mostly.probe(g, size=10)

# generate a synthetic dataset
sd = mostly.generate(g, size=10_000)

# download the synthetic dataset
df_synthetic = sd.data()

Key Features

Broad Data Support
Mixed-type data (categorical, numerical, geospatial, text, etc.)
Single-table, multi-table, and time-series
Multiple Model Types
TabularARGN for SOTA tabular performance
Fine-tune HuggingFace-based language models
Efficient LSTM for text synthesis from scratch
Advanced Training Options
GPU/CPU support
Differential Privacy
Progress Monitoring
Automated Quality Assurance
Quality metrics for fidelity and privacy
In-depth HTML reports for visual analysis
Flexible Sampling
Up-sample to any data volumes
Conditional generation by any columns
Re-balance underrepresented segments
Context-aware data imputation
Statistical fairness controls
Rule-adherence via temperature
Seamless Integration
Connect to external data sources (DBs, cloud storages)
Fully permissive open-source license

Synthetic Data SDK ✨ in our Tutorials

Our documentation comes with a full set of Tutorials that show various ways of working with the Synthetic Data SDK ✨ . Simply clone the tutorial repository to your own environment and run it locally via Jupyter Lab.
Colab Tutorials
Want to learn more about how synthetic data can help you?
magnifiercross