To truly unlock the power of AI, organizations need faster, more flexible access to high-quality data. But slow, restricted data access and centralized data teams often create bottlenecks. Privacy-preserving synthetic data eliminates these barriers, enabling intelligent, on-demand data access. With MOSTLY AI's open-source Synthetic Data SDK, organizations can generate high-fidelity synthetic data directly within their local AWS compute environment, including SageMaker Unified Studio. This empowers teams to democratize analytics, accelerate AI model development, and streamline software testing - without relying on sensitive or restricted datasets.

Getting Started

Getting started is as easy as executing one line of code in your SageMaker notebook:

!pip install -U mostlyai[local]

Once installed, you can launch an SDK instance in local mode:

# 1) Initialize the SDK in local mode  
import pandas as pd
from mostlyai.sdk import MostlyAI
mostly = MostlyAI(local=True)

Then load your original data. In this example we're loading the popular adult census dataset:

# 2) Load your original data
trn_df = pd.read_csv('https://github.com/mostly-ai/public-demo-data/raw/dev/census/census.csv.gz')
trn_df.head()

And train your first synthetic data generator:

# 3) Train a synthetic data generator   
g = mostly.train(name='census', data=trn_df)

Now you can live probe your generator on-demand. In this example we'll probe for 10 rows of synthetic census data:

# 4 ) Live probe generator    
df_samples = mostly.probe(g, size=10)
df_samples

And there you have it! High-fidelity privacy-preserving synthetic data generated directly in your SageMaker environment, with just a few lines of code!

Video

Check out this short video where Julio walks through the entire flow:

Conclusion

In one of our our previous blog posts, we demonstrated how to unlock data democratization and accelerate model development on Amazon SageMaker with privacy-preserving synthetic data. Now, with our open-source Synthetic Data SDK, organizations can take an even more frictionless approach to unlocking the power of AI in AWS - without the limitations of real-world data.