🚀 MOSTLY AI releases World’s First Industry-Grade Open-Source Toolkit for Synthetic Data
Read all about it here

Synthetic Data SDK ✨

The official Python SDK for MOSTLY AI. This toolkit allows you to programmatically create, browse and manage the 3 key resources of the MOSTLY AI Platform: Generators, Synthetic Datasets, Connectors

Simple Installation

The latest release of mostlyai can be installed via pip. Use [local] if you want to generate synthetic data locally in your environment and [local-gpu] to enable local GPU support.
pip install mostlyai
pip install mostlyai[local]
pip install mostlyai[local-gpu]
Go to the Package DocumentationGo to the Platform Documentation

Quick Start

If you want to use the SDK with the MOSTLY AI Platform, please obtain your personal API key from your account settings page, and adjust the following code snippet.

In order to run the SDK locally, initialize the SDK in local mode.
#!pip install mostlyai
import pandas as pd
from mostlyai.sdk import MostlyAI

# initialize SDK in local mode
# mostly = MostlyAI(local=True)
# or initalize SDK in client mode
mostly = MostlyAI(
    api_key='INSERT_YOUR_API_KEY',   # or set env var `MOSTLYAI_API_KEY`
    base_url='https://app.mostly.ai' # or set env var `MOSTLYAI_BASE_URL`
)

# train a generator on original data
df_original = pd.read_csv('https://github.com/mostly-ai/public-demo-data/raw/dev/census/census.csv.gz')
g = mostly.train(name='census', data=df_original)  # shorthand syntax for 1-table config

# live probe the generator for synthetic samples
df_samples = mostly.probe(g, size=10)

# generate a synthetic dataset
sd = mostly.generate(g, size=10_000)

# download the synthetic dataset
df_synthetic = sd.data()

Highlights

High Fidelity: powered by TabularARGN create synthetic data on par with state-of-the-art (SOTA) models
Privacy by Design: TabularARGN only considers privacy-preserving value ranges for sampling, and has built in privacy protection features. Plus can be trained via DP-SGD for obtaining differential privacy guarantees.
Compute Efficiency: With training speeds up to 100x faster, TabularARGN scales effectively, even for large and complex datasets
Sampling Flexibility: support for advanced sampling capabilities, including: Conditional generation, , Missing value imputation, Fairness adjustments, and Controlling sampling probabilities via temperature adjustments
Data Versatility: accommodates the heterogeneity of real-world tabular datasets, including: Multi-variate, mixed-type data (categorical, numerical, date-time, geo-spatial); Multi-sequence datasets with varying sequence lengths and varying time intervals.

Synthetic Data SDK ✨ in our Tutorials

Our documentation comes with a full set of Tutorials that show various ways of working with the Synthetic Data SDK ✨ . Simply clone the tutorial repository to your own environment and run it locally via Jupyter Lab.
Colab Tutorials
Want to learn more about how synthetic data can help you?
magnifiercross