MOSTLY AI’s synthetic data is highly representative granular-level data. It can be freely processed, utilized and shared, while the privacy of every data subject remains fully protected. The broad range of use cases that our platform serves, and the ROI that it delivers to our customers is a direct result of our unparalleled accuracy.
Value of Synthetic Data = f(Accuracy)
But how good is it? What does it mean when we state “as good as real” or even “better than real”? How can the accuracy of synthetic data anyways be objectively measured? And how can it be compared across offerings? These questions are among the first ones asked by any organization that is getting started with their synthetic data journey, and we are happy to shed more light on these via this blog post.
Along the way, we will also present results from a recent benchmarking study. MOSTLY GENERATE is being put up against two generative models (TVAE & CTGAN) across four distinct datasets. These models are considered to be leading contenders within the AI community. Once more, we are proud to demonstrate the superiority of our synthetic data platform, providing by far the best accuracy and thus value for our customers. Synthetic data, that is over and over again shown to be accurate enough to replace actual, privacy-sensitive data, whether it’s for testing & development, for advanced analytics, for AI training, or any of your other data initiatives.
Most machine learning applications are trained using supervised learning. This method uses labeled data (commonly referred to as “ground truth”) with the goal to predict a single target variable. By measuring the discrepancy between actuals and predicted values on new, yet unseen labeled data (so-called “holdout data”), it is possible to provide accuracy metrics that can be compared across algorithms. See e.g. here for a range of available metrics.
In contrast, state-of-the-art synthetic data algorithms are trained using self-supervised learning. This method uses available data, without the explicit notion of a target variable. These algorithms attempt to capture the overall underlying structure of a dataset with all its attributes and their distributions, correlations, consistency, coherence (for behavioral data), and so forth. All while ensuring that any extracted pattern generalizes beyond the given training data, and thus is independent of individual data subjects. A perfect synthetic data generator is capable of generating an arbitrary amount of new data that is indistinguishable from actual holdout data, both in terms of its statistics as well as its proximity to the provided training data. And thanks to the quickly rising popularity of synthetic data, more and more benchmarks are being established that allow us to compare these algorithms.
In this blog, we will discuss three distinct approaches to assess the accuracy of data:
- Descriptive Statistics: Compute summary statistics, as well as visually inspect distribution and relationships for a subset of variables, and compare these to the original data.
- Statistical Distances: Calculate distance measures for the empirical distributions of synthetic data vs. original data, and do so systematically for all univariate and bivariate distributions.
- Machine Learning Performance: Benchmark predictive accuracy of various ML models trained on the synthetic data vs. training on the original data.
The benchmarking study is conducted for 3 generative models (MOSTLY GENERATE, TVAE and CTGAN) and across 4 datasets, with all datasets being publicly available as part of the SDGym library:
– adult: ~23’000 training records, 10’000 holdout records, with 14 mixed-type attributes and one binary target variable (24% class imbalance)
– census: ~200’000 training records, ~100’000 holdout records, with 40 mixed-type attributes and one binary target variable (6% class imbalance)
– credit: ~265’000 training records, ~20’000 holdout records, with 29 numeric attributes and one binary target variable (0.17% class imbalance)
– news: ~33’000 training records, 8’000 holdout records, with 58 mixed-type attributes and one numeric (log-transformed) target variable
Note that these datasets serve us well as a baseline, but it needs to be emphasized that these do not come close to the scale, the diversity, and the complexity of real-world cases that we at MOSTLY AI successfully serve in practice. In particular, these four datasets only contain subject-level attributes, whereas the vast majority of organizations have an urgent need to synthesize behavioral data, a unique yet critical capability of MOSTLY GENERATE. It’s where the exploding volume of data is being collected, but also where the individual’s privacy is at its greatest risk.
A sensible first stab at evaluating synthetic data’s accuracy is to simply plot out selected distributions and statistics, and then visually compare these with the original data. While this is a straightforward approach, it needs to be emphasized that the numbers are not expected to match exactly. Only a 1:1 copy of the original data would be able to satisfy the requirement for perfect matches. But then, a 1:1 copy would also expose the privacy of all included data subjects. So, the statistics of a synthetic dataset are expected to deviate from the original, but ideally not significantly more than what we would expect from the sampling variance of an equally-sized holdout data.
This first chart exhibits a side-by-side comparison of the histograms for the categorical attribute `Education` from the adult dataset. While all three synthesizers seem to fare similarly well, more extensive relative discrepancies can be observed for CTGAN and TVAE for the less common categories.
A look at another categorical attribute of the adult dataset, `Occupation`, exhibits more significant discrepancies for the benchmark models, while MOSTLY GENERATE reliably retains the original distribution — the Craft-repair category gets heavily overestimated by CTGAN by nearly four percentage points and the Tech-support is nearly non-existent for TVAE. Only MOSTLY AI’s synthetic data remains close to the original distribution, and thus can serve as the foundation for running advanced analytics or evaluating policies on top of it.
Next up, the distribution of the numeric attribute `Age` for the census dataset is being compared. Thanks to the visual representation it is easy to spot CTGAN’s and TVAE’s issues. which reveals the following artifacts:
1) They both generate values outside of the original range from 0 to 90-year-old subjects.
2) They each miss the distinct occurrence of subjects, that are exactly 90 years old, as can be seen by the missing small vertical bars at the right end of the plots.
3) They generate non-existent modes and skews to the distribution.
These discrepancies tend to get larger once an analysis is run, not for one, but for multiple dimensions at the same time. The next table reports the average value and the share of positive values within `Capital Gain`, split by Gender. While MOSTLY GENERATE matches all these statistics well, CTGAN misses the gender gap in average capital gain, and both CTGAN and TVAE significantly overrepresent the share of positive values (~55% vs 8%).
As a final example, an analysis of the conditional distributions for `Gender` vs. `Relationship` is being displayed below. Whereas nearly (sic!) all wives are female in the original data, CTGAN assigns 1/3 of wives to the male category. On the other hand MOSTLY GENERATE is capable of learning and thus retaining the constraint, that Husbands are exclusively male.
While the above visualizations and charts are useful accuracy indicators, and allow for spotting qualitative issues, they also show the need for a more systematic approach — particularly when the number of bi-variate combinations grows quickly with the number of attributes. While the adult dataset, with its 10 attributes, offers to investigate 10*9/2 = 45 bi-variate distributions, a dataset with 60 attributes, like the news dataset, yields already 60*59/2 = 1’770 combinations, that would need to be checked. Clearly, investigating each one of them, doesn’t scale, and one might miss discrepancies that only occur for certain combinations.
For that reason, MOSTLY GENERATE reports these discrepancies in a fully automated manner, and includes these at granular and aggregate levels into its Quality Assurance report. In terms of measuring the discrepancy themselves, multiple choices are available. We advise looking at the:
1) L1-Distance (L1D), as the sum over all absolute deviations across categorical values.
2) Total Variational Distance (TVD), being the maximum over these deviations, and thus providing an estimated upper boundary for discrepancies.
Further distance measures, like the Hellinger Distance and L2D Distances can also be computed, but effectively yield similar findings. The following table provides an illustrative example for L1D and TVD:
The same distance measures can also be applied for numerical attributes by converting these into categorical values by binning them into a fixed number of buckets. One then calculates these errors for all univariate, and all bivariate distributions separately, and reports the average across all of these. This calculation is performed for all synthetic datasets, as well as for the equally-sized holdout, that serves as a lower boundary. Again, given enough training data, a perfect synthesizer should ideally be able to match the accuracy metrics of an holdout data.
As can be seen from the following results table, MOSTLY GENERATE not only outperforms existing alternatives by a huge margin, but is also nearly on par with the holdout data.
This lead becomes even more apparent when the error margin is not being reported in absolute terms, but as a difference to the theoretical minimum error that is attained by the actual holdout data. While MOSTLY GENERATE is typically less than 1.5% off for L1D and 1% off for TVD from the holdout performance, CTGAN and TVAE are consistently worse, with scores being anywhere between 10 and 45 percentage points off.
Drilling further down provides a column by column break-down, which reveals the strengths and potential weaknesses of each approach at a more granular level. But again, MOSTLY GENERATE consistently performs best across all attributes, across all measures.
It is also possible to extend this approach beyond bi-variate distributions towards higher level multi-variate distributions, but computational effort grows exponentially with the number of attributes. Therefore, one needs to resort to a sufficiently large sample of three, four or more variables at a time. Given the resulting low frequencies within three-way and four-way contingency tables, a Hellinger distance is more insightful regarding the accuracy than the TVD, as it shifts emphasis onto the many less frequent cells. The following table thus reports average L1D and Hellinger distances for a sample of 100 triplets respectively quartets for each of the four datasets, that provide further, strong evidence of MOSTLY GENERATE’s superior capability to retain statistical properties at the deepest level. The numbers once more show that MOSTLY GENERATE’s synthetic data is as good as actual holdout data.
Machine Learning Performance
Last, but not least, the accuracy is assessed by training several downstream machine learning models on synthetic data in place of the original data. For that purpose, a particular column is being selected for each dataset, that is then being modeled by all remaining attributes. A range of standard ML models is fitted, and evaluated on an actual holdout set. This setup is identical to our previously published demonstrations on synthetic upsampling, as well as follows SDGym’s benchmarking methodology for the given datasets. For the datasets adult, census and credit a binary target variable is selected, resulting in a classification task. For the news dataset a numeric variable is chosen as the target, resulting in a regression task. In comparison to SDGym, we will add two further state-of-the-art ML models to the evaluation (LightGBM and Xgboost), as well as add additional error metrics, to improve the validity and robustness of the findings. Find below the results averaged across all downstream ML models for all four datasets.
And these are the results for a single dataset, reported separately for each individual ML model. As you can see, MOSTLY GENERATE comes out on top across all considered datasets, all considered ML models, and all considered accuracy measures.
We take pride in offering the world’s most accurate, most secure and most user friendly synthetic data platform on the market. In this blog post we’ve discussed three distinct approaches to assess the accuracy of synthetic data, and presented these in the context of a recently conducted benchmarking study. As was demonstrated, our solution not only outperforms the top contenders from the academic community by a significant margin, but it also comes nearly on par with the utility of actual data. This is yet another strong empirical validation that MOSTLY AI’s synthetic data serves well for testing and development, for advanced analytics, for machine learning, and for pretty much any data-related initiative, while ensuring that customers’ privacy remains perfectly protected.
So, no matter which query or which algorithm is being run, no matter which metric is being used, MOSTLY GENERATE continues to emerge on top. And it’s a pledge to our customers, that we will keep innovating and improving our algorithms to further improve the accuracy – and thus the value of our solution for the years to come.
As always, if this triggered your curiosity, please reach out to us and learn more about how you can go synthetic, and become the superhero of your organization who can finally make both data sharing and data protection possible at the same time.
Credits: This work is supported by the “ICT of the Future” funding programme of the Austrian Federal Ministry for Climate Action, Environment, Energy, Mobility, Innovation and Technology.