>

Resources

>

February 19, 2025

Accelerating Privacy-Preserving Synthetic Data in Databricks with the Open-Source Synthetic Data SDK

Written by

Access to high-quality data is essential for organizations to accelerate AI and analytics initiatives. However, privacy concerns, compliance constraints, and slow data access often stand in the way. Privacy-preserving synthetic data removes these roadblocks, providing on-demand, high-fidelity datasets without compromising sensitive information.

Powered by MOSTLY AI, the Synthetic Data SDK enables you to generate synthetic data directly within your Databricks environment. With it, teams can securely democratize data access, accelerate AI model development, and streamline testing workflows—all while safeguarding privacy.

Getting Started

Getting started with the Synthetic Data SDK is seamless. With just a few lines of code in your Databricks notebook, you can train and generate privacy-safe synthetic datasets.

1. Install the SDK

Run the following command in your Databricks notebook to install the open-source SDK and necessary packages:

pip install -U mostlyai[local]

dbutils.library.restartPython()

2. Initialize the SDK in Local Mode

Start an SDK instance in local mode, ensuring that all computations happen securely within your environment:

from mostlyai.sdk import MostlyAI

# 1) Initialize the SDK in local or client mode
mostly = MostlyAI(local=True)

3. Read in and Write Data to a Delta Table (Optional)

Since Delta Lake is the standard for data storage in Databricks, we first read the data from GitHub and write it to a Delta Table in Unity Catalog.

# 2) Load your original data (public data)
trn_df = pd.read_csv('https://github.com/mostly-ai/public-demo-data/raw/dev/census/census.csv.gz')

# Convert Pandas DataFrame to Spark DataFrame
spark_df = spark.createDataFrame(trn_df)

# Write to a managed Delta table in Unity Catalog
spark_df.write.format("delta").mode("overwrite").saveAsTable("main.default.census_data")

4. Read from the Delta Table in Unity Catalog

Now, we read the data from the Delta Table we wrote out to Unity Catalog in the previous step so it can be used with the SDK.

# 3) Read data from the Delta table
trn_df = spark.read.table("main.default.census_data")

5. Train a Synthetic Data Generator from a Unity Catalog Table

A Synthetic Data Generator is a trained generative model plus some metadata. With one line of code you can train a Generator. The generative model will be trained with the original data that you’re providing with the goal to learn generalizable patterns and structures, while ensuring privacy:

# 4) Train a synthetic data generator
g = mostly.train(name='census', data=trn_df)

6. Generate Synthetic Data

With your Generator trained, you can now generate synthetic datasets that mirror the statistical properties of the original data while ensuring privacy.

# 5) Generate a full privacy preserving synthetic dataset
sd = mostly.generate(g)
syn_df = sd.data()
syn_df

7. Write Synthetic Data to Unity Catalog

Once the synthetic data is generated, we save it back to a Delta table in Unity Catalog, making it easily accessible for downstream analysis, sharing, and AI workloads.

# 6) Write synthetic data to a Delta table in Unity Catalog
syn_spark_df = spark.createDataFrame(syn_df)
syn_spark_df.write.format("delta").mode("overwrite").saveAsTable("main.default.synthetic_census_data")


# Verify the synthetic data is written correctly
display(spark.read.table("main.default.synthetic_census_data"))

Conclusion

The Synthetic Data SDK, powered by MOSTLY AI, empowers organizations to unlock the full potential of privacy-preserving synthetic data directly within Databricks. By enabling fast, secure, and scalable data generation, it accelerates AI model development, democratizes data access, and ensures compliance with privacy regulations—all without compromising data quality.

This is just the beginning. There’s much more to explore as synthetic data continues to evolve within Databricks. Stay tuned for future updates, deeper integrations, and expanded capabilities that will make working with data even more seamless and powerful.

Ready to get started? Explore the SDK and detailed documentation on our GitHub repository. Transform the way you work with data today!

See it in action

Related posts

A Comparison of Synthetic Data Vault and MOSTLY AI, Part 1: Single-Table Scenario

Kenneth Hamilton

If you’d like to recreate this experiment yourself, follow along in the companion notebook. Introduction This blog is the first in a three-part series of experiments comparing the synthetic data generation functionalities and features of two leading synthetic data generation platforms: Synthetic Data Vault (SDV) and MOSTLY AI. These experiments expand on an earlier comparison […]

Differentially Private Synthetic Data with MOSTLY AI

November 18, 2024

Michael Platzer

MOSTLY AI now offers users the choice to train synthetic data generators with or without differential privacy (DP) guarantees! By toggling DP on in the model configuration, training becomes differentially private, and the platform tracks the privacy budget (Epsilon) throughout. Users can also set an upper Epsilon limit, stopping training automatically when the budget is […]

Contact MOSTLY AI

How to create the best user experience for data science

MOSTLY AI’s new intuitive user interface is not just a pretty face, it’s now even faster to generate synthetic data.

Ready to start?

Sign up for free or contact our sales team to schedule a demo.

Get started free Request a demo