Fueled by the exponential growth in external data and AI for innovation, Databricks has introduced Clean Rooms to enable organizations to securely collaborate on sensitive data while maintaining privacy. Participants can share and join data, run complex workloads in any language, and retain full control of where and how that data is being used - without exposing raw information. 

The open-source Synthetic Data SDK from MOSTLY AI amplifies the power of Clean Rooms. Differentially private synthetic data accelerates data access, enables dynamic data generation, provides granular data views, and allows cross-industry insights to be shared safely and flexibly beyond the Clean Room.

In this blog, we’ll explore three ways synthetic data extends the power of Databricks Clean Rooms.

1. Merge Sensitive Data, Share a Synthetic Version

Tl;dr: Synthetic data allows participants to create, download, and share a privacy-safe, unified dataset, enabling independent parties to analyze cross-industry insights beyond the Clean Room.

Organizations often hold complementary customer data but cannot directly share it due to privacy regulations. Clean Rooms provide a secure space for merging and analysing these datasets, but the data always remains inside the Clean Room. By leveraging the Synthetic Data SDK, Databricks customers can generate a synthetic version of the merged dataset, and allow both parties to download and independently analyze the new unified insights - all while maintaining strict privacy controls.

  1. Both parties share their real datasets in a Databricks Clean Room.
  2. Merge datasets on a common ID (e.g. hashed emails) without exposing granular data.
  3. Train a Generator on the combined dataset inside the Clean Room.
  4. Generate a unified synthetic dataset that retains statistical properties while protecting privacy.
  5. Both parties independently download and analyze the granular unified synthetic dataset.

The Code to call your Generator inside a Clean Room:

# install SDK, restart python environment, import libraries 
!pip install -U mostlyai[local]     
dbutils.library.restartPython() 

import pandas as pd
from mostlyai.sdk import MostlyAI

# initialize the SDK in local or client mode 
mostly = MostlyAI(local=True)

from pyspark.sql.functions import col

# define cleanroom table paths 
bank_table = "creator.default.bank_a_customer_dataset"
insurance_table = "collaborator.default.insurance_b_customer_dataset"

# load datasets into DFs 
bank_df = spark.read.table(bank_table)
insurance_df = spark.read.table(insurance_table)

# join on common ID 
cleanroom_joined_df = bank_df.join(insurance_df, "ID", "inner")

# set configuration with `ID` as Primary Key) 
config = {
   "name": "cleanroom_synthetic_generator",
   "tables": [
       {
           "name": "customer_data",
           "data": cleanroom_joined_df,  
           "primary_key": "ID"    
       }
   ]
}


# train the generator using config
generator = mostly.train(config=config, wait=True)

# generate a merged synthetic dataset
synthetic_data = mostly.generate(generator)

View your merged synthetic dataset on a granular row-by-row basis directly in the Clean Room. And/or create an output table that lets you temporarily save the output of notebooks that are run in a clean room to an output catalog in your Unity Catalog metastore, where you can make the data available to members of your team who don’t have the ability to run the notebooks themselves.

Here's a full end-to-end Video demonstrating how this works:

2. Share Synthetic Data Instead of Sensitive Data

Tl;dr: Synthetic data is a fast, safe, and granular alternative to sharing your sensitive data assets.

Databricks Clean Rooms enable secure data collaboration, however approvals to upload and share sensitive data can take weeks or even months. Once approved, participants can run mutually approved analyses on the data, but their direct access is often limited to a metadata-level view - such as column names, data types, and schema descriptions. While this provides valuable context, deeper insights come from interacting with granular row-level data directly. By leveraging the open-source Synthetic Data SDK, Databricks customers can stack Privacy-Enhancing Technologies, generating differentially private, high-fidelity synthetic data directly within their local Databricks workspace. Because synthetic data eliminates privacy risks at the source, approvals move faster, enabling analysis with the same granularity as raw data - without the long wait.

The Workflow:

  1. Train a model (Generator) on real data to learn patterns and distributions.
  2. Generate Synthetic Data that preserves statistical properties while ensuring subject-level privacy.
  3. Share the Synthetic Dataset inside the Databricks Clean Room.
  4. Run analysis on privacy-safe dataset aided by row-level granular view of data.

The Code to create your synthetic dataset:

# 1) install SDK, restart python environment, import libraries 
!pip install -U mostlyai[local]     
dbutils.library.restartPython() 

import pandas as pd
from mostlyai.sdk import MostlyAI

# 2) initialize the SDK in local or client mode 
mostly = MostlyAI(local=True)

# 3) read the tables into Spark DF and convert DF to Pandas 
spark_df_transactions = spark.table("jscatalog_productiondata.default.us_census_income")
census = spark_df_transactions.toPandas()

# 4) train a synthetic data generator 
g = mostly.train(name='census', data=census)

# 5) generate a privacy preserving synthetic dataset
sd = mostly.generate(g, size=1000)
syn_df = sd.data()

# 6) convert to Spark DF and write to unity catalog  
syn_df_spark = spark.createDataFrame(syn_df)
syn_df_table_path = "jscatalog_syntheticdata.default.us_census_income_synth"
syn_df_spark.write.mode("overwrite").saveAsTable(syn_df_table_path)

Here's a full end-to-end Video demonstrating how this works:

3. Share a Model, Not Data

Tl;dr: A model is more dynamic and flexible than data (real or synthetic), allowing collaborators to generate exactly the data they need - on demand inside the Clean Room.

As an alternative to sharing fixed datasets, organizations can share a pre-trained, privacy-preserving generator inside the Clean Room. A generator consists of a GenAI model along with metadata. Collaborators can then flexibly generate synthetic data as needed, adjusting volume, distributions, or conditions - ensuring the data aligns perfectly with their use case, all while maintaining strict privacy controls.

 The Workflow:

  1. Train a Generator on real data to capture relationships and distributions.
  2. Share the Generator inside the Databricks Clean Room.
  3. Allow collaborators to dynamically create synthetic data on demand, adjusting volume, features, and distributions to fit their needs.

The Code to save your Generator to a Unity Catalog Volume:

# repeat steps 1-4 from above

# export the generator to the Unity Catalog volume 
generator_file_path = '/Volumes/jscatalog_productiondata/default/generator/us_census_generator.zip'
g.export_to_file(file_path=generator_file_path)

The Code to call your Generator inside a Clean Room:

# import generator from file 
generator_file_path =
'/Volumes/jscatalog_productiondata/default/generator/us_census_generator.zip'
g_imported = mostly.generators.import_from_file(file_path=generator_file_path)

# generate a full privacy preserving synthetic dataset
sd = mostly.generate(g_imported, size=1000)
syn_df = sd.data()

Here's a full end-to-end Video demonstrating how this works:

Conclusion

Fueled by the exponential growth in external data and AI for innovation, organizations across all industries are looking for effective ways to collaborate with their partners in a privacy-safe way. Databricks Clean Rooms enable organizations to securely collaborate on sensitive data while maintaining strict privacy controls.

The next frontier in secure collaboration comes from stacking Privacy-Enhancing Technologies (PETs). Differentially private synthetic data amplifies the power of Clean Rooms - accelerating data access, offering granular data views, enabling dynamic data generation, and allowing cross-industry insights to be shared safely and flexibly beyond the Clean Room.

The combination of Clean Rooms and Differentially Private Synthetic Data represents the future of secure data collaboration in highly regulated industries.