Unlock data democratization and accelerate model development on Amazon SageMaker with privacy preserving Synthetic Data

Written by

Data fuels Machine Learning (ML), but privacy constraints often delay access to sensitive real world (original) datasets for weeks or months. MOSTLY AI bridges this gap by providing privacy-safe synthetic data in minutes, enabling ML developers to build and test models in development environments without accessing the sensitive data. This article describes how to use privacy-preserving synthetic data in Amazon SageMaker to unlock data democratization and accelerate the ML lifecycle, enabling faster, secure model development, without compromising sensitive data.

The original data is synthesized in the Production environment, and the generated Synthetic Data is consumed in the Development environment.

Architecture Overview

Each account is composed of an independent Amazon Virtual Private Cloud (VPC), with MOSTLY AI running inside a dedicated Amazon Elastic Kubernetes Service (EKS) in the Production environment.

The workflow

Step 0: Enable access to original data
Step 1: Create a Generator in production VPC
Step 2: Create Synthetic Data and write to development VPC
Step 3: Build a model with Synthetic Data in SageMaker Canvas
Step 4: Transition to Production with Original Data

Step 0: Enable the access to original data

MOSTLY AI can be deployed on Amazon Elastic Kubernetes Service (EKS), allowing seamless integration with AWS services for secure, scalable synthetic data generation. Once deployed, create two connectors: (1) a source connector to read the original data, and (2) a destination connector to write the synthetic data.

The original data is hosted in an Amazon S3 bucket in the Production VPC.

Two connectors (one as source to the Production VPC and one as destination to the Development VPC) are created in MOSTLY AI.

Step 1: Create a Generator in the production VPC

Use the MOSTLY AI source connector to create a Generator, which is a combination of a trained model plus the metadata, based on the original data.

The read-only connector to the Production VPC allows the user to use that data to create a Generator.

Step 2: Create Synthetic Data and write to the development environment

With the trained Generator, create a privacy-preserving synthetic dataset that mirrors the original data's structure and statistical properties. Use the destination connector to write the synthetic data directly to the Amazon S3 bucket in the development environment.

The connector to the Development VPC allows the user to write synthetic data for use and consumption by a wider audience.

Confirm the synthetic data is accessible in the designated target bucket (SyntheticDatasets.Adult in our example) and ready for use by the ML developer in model testing and development. With this approach, the time to access to quality data is drastically reduced, since only the data owner has access to the original data, and the synthetic data can be easily shared with multiple data consumers, as it’s completely privacy preserving while holding the statistical quality of the original data.

The synthetic data is available in an Amazon S3 bucket in the Development VPC.

Step 3: Build a model with Synthetic Data in SageMaker Canvas

With the synthetic dataset now in your development environment, it’s time to start model development in Amazon SageMaker Canvas, which enables you to build your own AI/ML models without having to write a single line of code.

Thanks to SageMaker Canvas and the fast access to Synthetic Data from the development environment, we can run faster iterations in downstream ML tasks, like classification, prediction, or clustering, all of which use outputs from a trained model or dataset to achieve a goal, enabling safe and efficient model training without risking individual privacy.

Begin by uploading the synthetic dataset from the development Amazon S3 bucket into SageMaker Canvas, where users can train and evaluate models on data that mirrors real patterns but remains privacy-safe.

The data users create downstream ML tasks in SageMaker Canvas.

The data user imports the synthetic data into SageMaker Canvas.

Synthetic data is privacy-preserving and can be freely shared and visualized.

In SageMaker Canvas, run a training cycle on the synthetic dataset. Choose between Quick or Standard builds, explore various algorithms or ensembles, and assess model performance using metrics like accuracy and recall. This step provides ML developers with a realistic preview of model behavior, enabling secure testing and optimization without requiring access to sensitive data.

The data user creates a basic ML model with Synthetic data in SageMaker Canvas.

Step 4: Transition to Production with Original Data

After completing model testing on synthetic data, the ML developer can hand over the optimized model configuration to the data owner. This approach allows ML developers to quickly iterate and refine models in development, while ensuring that the final production model benefits from the full depth of the original data.

Validation

Finally, to showcase the effectiveness of synthetic data, you can perform a side-by-side comparison between models trained on synthetic and original data. The objective is for the model trained on synthetic data to perform closely to one trained on the original dataset. If performance metrics are aligned, this verifies that MOSTLY AI’s synthetic data provides a reliable foundation for model development, affirming its utility for real-world applications.

The metrics of the models created with original and synthetic data are on-par.

Conclusion

The integration of MOSTLY AI's privacy-preserving synthetic data capabilities with Amazon SageMaker enables ML developers to access high-quality synthetic datasets in minutes rather than weeks or months, thereby dramatically accelerating the model development lifecycle. This use case demonstrates that synthetic data can match the performance of original data in downstream ML tasks. Furthermore, this approach democratizes access to valuable data resources within organizations, leading to more innovative solutions, as teams can experiment freely and iterate rapidly without being hindered by data access constraints or privacy concerns.

The MOSTLY AI Platform is available for purchase on the AWS Marketplace here.