If you’d like to recreate this experiment yourself, follow along in the companion notebook.

Introduction

This blog is the first in a three-part series of experiments comparing the synthetic data generation functionalities and features of two leading synthetic data generation platforms: Synthetic Data Vault (SDV) and MOSTLY AI. These experiments expand on an earlier comparison of the two platforms we published comparing a smaller dataset, but the goal of this blog is to highlight how these tools handle datasets containing hundreds of thousands or even millions of rows.

SDV is a Business-Source Python library developed by DataCebo. It provides a framework for generating synthetic data that mimics real-world tabular data. SDV supports single-table, multi-table, and time-series data, and offers multiple modeling techniques, including traditional statistical methods and deep generative models such as Generative Adversarial Networks (GANs) and Variational Autoencoders (VAEs). This experiment focuses on a single-table use case. We’ll explore these modeling approaches in more detail throughout the series.

The Synthetic Data SDK by MOSTLY AI is an open-source Python-language library that you can use to create privacy-preserving synthetic datasets that retain the structure, statistical properties, and business logic of real data. MOSTLY AI uses TabularARGN, an advanced generative AI model, as well as deep learning to simulate realistic data across structured formats, supporting a wide range of use cases such as AI training, software testing, data monetization, and regulatory compliance. Beyond the Synthetic Data SDK, MOSTLY AI offers a user-friendly platform, enterprise-grade APIs, automated privacy risk assessments, and built-in support for sensitive data handling, including Personal Identifiable Information (PII).

This experiment was performed on a MacBook Pro with an Apple M4 processor and 32GB of memory, different equipment is likely to produce different results.

Experiment Setup

For the single-table scenario, we’ll use the publicly available American Community Survey (ACS) median household income dataset with 1.4 million rows available in the MOSTLY AI Public Datasets repository. This dataset contains 15 unique columns recording demographic information about the person defined in each row. Generating high-fidelity synthetic data from large datasets often poses a challenge for less advanced tools, which tend to overfit or collapse rare categories. The combination of both numerical and categorical data, as well as its size, makes this dataset ideal for testing the capabilities of both frameworks.

A preview of the ACS median household income data used in this experiment

Before training, we’ll create an 80/20 training and testing split to ensure that the holdout data can be used to validate the generative performance of each framework on unseen data. This approach allows us to measure how well the synthetic data replicates patterns not just from the training data, but from the overall distribution. 

Training and Generation

For a single-table use case like the one in this experiment, SDV recommends using the Gaussian Copula Synthesizer, whilst MOSTLY AI will use TabularARGN. By default, MOSTLY AI automatically detects the correct model configuration to use for a given dataset to guarantee optimal synthetic data generation but also offers the ability to use any publicly available Language model of your choice.

During this experiment, both generators were trained on the same 1.1 million rows of training data. Synthetic data generation after training took roughly the same amount of time and resulted in the generation of 1.4 million rows of synthetic data.

The Synthetic Data SDK from MOSTLY AI provides a convenient status tracker directly in the notebook so you can track job progress.

The Synthetic Data SDK by MOSTLY AI offers a number of tools to enhance the user experience

Quality Assessment and Comparison

To assess the quality of each synthetic dataset, we’ll use the Synthetic Data Quality Assurance by MOSTLY AI to compare the generated data against both the training and holdout sets. Our evaluation will examine how accurately the synthetic data reproduces univariate, bivariate, and trivariate distributions found in the original dataset, helping us gauge each model’s ability to maintain key statistical relationships. We will also conduct similarity analysis to determine whether the synthetic data aligns more closely with the training data or if it generalizes well enough to reflect characteristics of the holdout set also.

We’ll assess dataset privacy using Distance to Closest Record (DCR) metrics. This involves computing how far each synthetic record is from its nearest counterpart in both the training and testing datasets. Higher distances suggest stronger privacy but weaker utility, as it indicates that synthetic records are not close replicas of real individuals. A balanced DCR Share around 0.5 is generally interpreted as optimal, suggesting the synthetic data achieves utility while maintaining privacy.  All of these elements will be aggregated into the final scorecard that allows for a direct comparison of the two frameworks across fidelity, generalization, and privacy dimensions.

You can review the results of the SDV and MOSTLY AI QA reports generated by the Synthetic Data Quality Assurance by MOSTLY AI framework for a complete analysis of the results of this experiment.

The Model Report provides detailed information about model performance, MOSTLY AI (left) and SDV (right) are shown

The accuracy of the generated data achieved with SDV is 52.7%, the accuracy achieved with MOSTLY AI for the same operation is 97.8%.

Exploring the accuracy of the data generated with the Synthetic Data SDK by MOSTLY AI in more detail we can see that even at the bivariate and trivariate analysis levels, MOSTLY AI generates significantly more accurate data that maintains correlations between features that are totally unmatched by less robust solutions. During trivariate analysis, MOSTLY AI shows superior performance by more than 60%. While SDV performs adequately on univariate analyses (71.7%), it shows weaknesses as the analyses become more complex with a trivariate score of only 35.4%.

Both tools performed well when it comes to privacy with MOSTLY AI achieving a DCR Share of 0.503 and SDV earning a score of 0.530. A score close to 0.500 implies strong privacy protection without destruction of data utility.

Another stark difference between the two solutions is the Discriminator Area Under Curve (AUC) score. This metric assesses whether a discriminative model can distinguish between real samples and generated ones, after mapping them into a meaningful embedding space. A Discriminator AUC value close to 50% implies that the synthetic data is indistinguishable from real samples. We see a Discriminator AUC score of 100% for SDV-generated customers but a strong 59.6% for MOSTLY AI.

Conclusions

This single-table scenario experiment shows significant differences in the quality of generated synthetic data: an analysis on the quality of the generated data shows that MOSTLY AI generated synthetic data with an overall accuracy of 97.8% while SDV generated data with an accuracy of 52.7%. In short, Accuracy is a measure of how close generated synthetic data is to the real-world data it was used to train the underlying models. A 45% difference in overall accuracy is significant and shows the strength of the Synthetic Data SDK by MOSTLY AI on sufficiently large datasets. When exploring accuracy more deeply, the difference between the two solutions was even more stark. While this difference might not be relevant when synthetic data is used as mock or dummy data, it is highly relevant when synthetic data is used for data science purposes such as training a downstream ML model.

The Synthetic Data SDK by MOSTLY AI is shown to generate highly accurate synthetic data while still preserving data subject privacy

Install the Synthetic Data SDK by MOSTLY AI today to experience best-in-class synthetic data generation with unparalleled levels of accuracy and stay tuned for our next experiment: Multi-Table Scenarios.