The Synthetic Data SDK is a Python toolkit for high-fidelity, privacy-safe Synthetic Data.
The SDK allows you to programmatically create, browse and manage 3 key resources:
Intent | Primitive | API Reference |
---|---|---|
Train a Generator on tabular or language data | g = mostly.train(config) |
mostly.train |
Generate any number of synthetic data records | sd = mostly.generate(g, config) |
mostly.generate |
Live probe the generator on demand | df = mostly.probe(g, config) |
mostly.probe |
Connect to any data source within your org | c = mostly.connect(config) |
mostly.connect |
pip install mostlyai
pip install mostlyai[local]
pip install mostlyai[local-gpu]
#!pip install mostlyai
import pandas as pd
from mostlyai.sdk import MostlyAI
# initialize SDK in local mode
# mostly = MostlyAI(local=True)
# or initalize SDK in client mode
mostly = MostlyAI(
api_key='INSERT_YOUR_API_KEY', # or set env var `MOSTLYAI_API_KEY`
base_url='https://app.mostly.ai' # or set env var `MOSTLYAI_BASE_URL`
)
# train a generator on original data
df_original = pd.read_csv('https://github.com/mostly-ai/public-demo-data/raw/dev/census/census.csv.gz')
g = mostly.train(name='census', data=df_original) # shorthand syntax for 1-table config
# live probe the generator for synthetic samples
df_samples = mostly.probe(g, size=10)
# generate a synthetic dataset
sd = mostly.generate(g, size=10_000)
# download the synthetic dataset
df_synthetic = sd.data()