>

Resources

>

March 19, 2025

Maximizing the Potential of Databricks Clean Rooms with Synthetic Data

Written by

Fueled by the exponential growth in external data and AI for innovation, Databricks has introduced Clean Rooms to enable organizations to securely collaborate on sensitive data while maintaining privacy. Participants can share and join data, run complex workloads in any language, and retain full control of where and how that data is being used - without exposing raw information.

The open-source Synthetic Data SDK from MOSTLY AI amplifies the power of Clean Rooms. Differentially private synthetic data accelerates data access, enables dynamic data generation, provides granular data views, and allows cross-industry insights to be shared safely and flexibly beyond the Clean Room.

In this blog, we’ll explore three ways synthetic data extends the power of Databricks Clean Rooms.

1. Merge Sensitive Data, Share a Synthetic Version

Tl;dr: Synthetic data allows participants to create, download, and share a privacy-safe, unified dataset, enabling independent parties to analyze cross-industry insights beyond the Clean Room.

Organizations often hold complementary customer data but cannot directly share it due to privacy regulations. Clean Rooms provide a secure space for merging and analysing these datasets, but the data always remains inside the Clean Room. By leveraging the Synthetic Data SDK, Databricks customers can generate a synthetic version of the merged dataset, and allow both parties to download and independently analyze the new unified insights - all while maintaining strict privacy controls.

Both parties share their real datasets in a Databricks Clean Room.
Merge datasets on a common ID (e.g. hashed emails) without exposing granular data.
Train a Generator on the combined dataset inside the Clean Room.
Generate a unified synthetic dataset that retains statistical properties while protecting privacy.
Both parties independently download and analyze the granular unified synthetic dataset.

The Code to call your Generator inside a Clean Room:

# install SDK, restart python environment, import libraries 
!pip install -U mostlyai[local]     
dbutils.library.restartPython() 

import pandas as pd
from mostlyai.sdk import MostlyAI

# initialize the SDK in local or client mode 
mostly = MostlyAI(local=True)

from pyspark.sql.functions import col

# define cleanroom table paths 
bank_table = "creator.default.bank_a_customer_dataset"
insurance_table = "collaborator.default.insurance_b_customer_dataset"

# load datasets into DFs 
bank_df = spark.read.table(bank_table)
insurance_df = spark.read.table(insurance_table)

# join on common ID 
cleanroom_joined_df = bank_df.join(insurance_df, "ID", "inner")

# set configuration with `ID` as Primary Key) 
config = {
   "name": "cleanroom_synthetic_generator",
   "tables": [
       {
           "name": "customer_data",
           "data": cleanroom_joined_df,  
           "primary_key": "ID"    
       }
   ]
}


# train the generator using config
generator = mostly.train(config=config, wait=True)

# generate a merged synthetic dataset
synthetic_data = mostly.generate(generator)

View your merged synthetic dataset on a granular row-by-row basis directly in the Clean Room. And/or create an output table that lets you temporarily save the output of notebooks that are run in a clean room to an output catalog in your Unity Catalog metastore, where you can make the data available to members of your team who don’t have the ability to run the notebooks themselves.

Here's a full end-to-end Video demonstrating how this works:

2. Share Synthetic Data Instead of Sensitive Data

Tl;dr: Synthetic data is a fast, safe, and granular alternative to sharing your sensitive data assets.

Databricks Clean Rooms enable secure data collaboration, however approvals to upload and share sensitive data can take weeks or even months. Once approved, participants can run mutually approved analyses on the data, but their direct access is often limited to a metadata-level view - such as column names, data types, and schema descriptions. While this provides valuable context, deeper insights come from interacting with granular row-level data directly. By leveraging the open-source Synthetic Data SDK, Databricks customers can stack Privacy-Enhancing Technologies, generating differentially private, high-fidelity synthetic data directly within their local Databricks workspace. Because synthetic data eliminates privacy risks at the source, approvals move faster, enabling analysis with the same granularity as raw data - without the long wait.

The Workflow:

Train a model (Generator) on real data to learn patterns and distributions.
Generate Synthetic Data that preserves statistical properties while ensuring subject-level privacy.
Share the Synthetic Dataset inside the Databricks Clean Room.
Run analysis on privacy-safe dataset aided by row-level granular view of data.

The Code to create your synthetic dataset:

# 1) install SDK, restart python environment, import libraries 
!pip install -U mostlyai[local]     
dbutils.library.restartPython() 

import pandas as pd
from mostlyai.sdk import MostlyAI

# 2) initialize the SDK in local or client mode 
mostly = MostlyAI(local=True)

# 3) read the tables into Spark DF and convert DF to Pandas 
spark_df_transactions = spark.table("jscatalog_productiondata.default.us_census_income")
census = spark_df_transactions.toPandas()

# 4) train a synthetic data generator 
g = mostly.train(name='census', data=census)

# 5) generate a privacy preserving synthetic dataset
sd = mostly.generate(g, size=1000)
syn_df = sd.data()

# 6) convert to Spark DF and write to unity catalog  
syn_df_spark = spark.createDataFrame(syn_df)
syn_df_table_path = "jscatalog_syntheticdata.default.us_census_income_synth"
syn_df_spark.write.mode("overwrite").saveAsTable(syn_df_table_path)

Here's a full end-to-end Video demonstrating how this works:

3. Share a Model, Not Data

Tl;dr: A model is more dynamic and flexible than data (real or synthetic), allowing collaborators to generate exactly the data they need - on demand inside the Clean Room.

As an alternative to sharing fixed datasets, organizations can share a pre-trained, privacy-preserving generator inside the Clean Room. A generator consists of a GenAI model along with metadata. Collaborators can then flexibly generate synthetic data as needed, adjusting volume, distributions, or conditions - ensuring the data aligns perfectly with their use case, all while maintaining strict privacy controls.

The Workflow:

Train a Generator on real data to capture relationships and distributions.
Share the Generator inside the Databricks Clean Room.
Allow collaborators to dynamically create synthetic data on demand, adjusting volume, features, and distributions to fit their needs.

The Code to save your Generator to a Unity Catalog Volume:

# repeat steps 1-4 from above

# export the generator to the Unity Catalog volume 
generator_file_path = '/Volumes/jscatalog_productiondata/default/generator/us_census_generator.zip'
g.export_to_file(file_path=generator_file_path)

The Code to call your Generator inside a Clean Room:

# import generator from file 
generator_file_path =
'/Volumes/jscatalog_productiondata/default/generator/us_census_generator.zip'
g_imported = mostly.generators.import_from_file(file_path=generator_file_path)

# generate a full privacy preserving synthetic dataset
sd = mostly.generate(g_imported, size=1000)
syn_df = sd.data()

Here's a full end-to-end Video demonstrating how this works:

Conclusion

Fueled by the exponential growth in external data and AI for innovation, organizations across all industries are looking for effective ways to collaborate with their partners in a privacy-safe way. Databricks Clean Rooms enable organizations to securely collaborate on sensitive data while maintaining strict privacy controls.

The next frontier in secure collaboration comes from stacking Privacy-Enhancing Technologies (PETs). Differentially private synthetic data amplifies the power of Clean Rooms - accelerating data access, offering granular data views, enabling dynamic data generation, and allowing cross-industry insights to be shared safely and flexibly beyond the Clean Room.

The combination of Clean Rooms and Differentially Private Synthetic Data represents the future of secure data collaboration in highly regulated industries.

Related posts

Introducing TabularARGN – A Flexible and Efficient Auto-Regressive Framework for Generating High-Fidelity Synthetic Data

January 23, 2025

TabularARGN is powering the synthetic data generation capabilities of the MOSTLY AI Platform, delivering reliable quality and efficiency to our users. Today, we are pleased to share the architecture and implementation of this foundational model framework with the community. In our latest paper we present the TabularARGN framework, now part of our open-source SDK, making […]

Introducing the MOSTLY AI PRIZE

TL;DR The MOSTLY AI PRIZE is a global competition inviting data scientists, researchers, developers, and really everyone to create the most accurate and privacy-safe synthetic tabular data. With a grand prize of $100,000, the challenge runs from May 14 to July 3, 2025. Participants will work with real-world datasets to develop synthetic versions that maintain […]

The Synthetic Data SDK - An open-source python toolkit for high-fidelity privacy-safe synthetic data

January 23, 2025

Michael Platzer

Data is the lifeblood of the digital era, as it captures an ever growing amount of knowledge. Yet, while our future is being increasingly shaped by smart products and services that build on top of data, it is more often than not, that only a few people will ever get the opportunity to access such […]

Ready to start?

Sign up for a free account or contact our sales team to schedule a demo.

Get started free Request a demo