💡 Announcing the MOSTLY AI and Databricks Integration
Read all about it here
November 14, 2023
13m 39s

Quality assurance for synthetic data tutorial

Welcome to our comprehensive tutorial on mastering Quality Assurance for synthetic data. Understanding the quality of your synthetic data is a crucial for everyone working with synthetic data. In this detailed walkthrough, we'll delve deep into the key concepts behind MOSTLY AI's QA reports, focusing on both privacy and accuracy aspects of synthetic data generation.

📈 What You'll Learn:

- Understand the fundamentals of MOSTLY AI's QA reports.
- Navigate and interpret different sections of a QA report, including the model and data QA reports.
- Grasp the significance of univariate and bivariate distributions, correlations, accuracy, and different privacy metrics in QA.
- Explore a practical coding session to approximate MOSTLY AI's calculations for these metrics.
- Discover how to generate, analyze, and evaluate synthetic data using MOSTLY AI.

Dataset and code: https://bit.ly/47e5lKq

Synthetic data generation platform: https://bit.ly/3M8Lhkb

Key moments:

00:00 - Overview of MOSTLY AI's QA Reports

00:04 - Introduction to the key concepts behind MOSTLY AI's QA reports

00:16 - Explaining how MOSTLY AI quantifies both the privacy and accuracy parts of its QA reports.

00:21 - Guide on how to navigate to the QA report section in MOSTLY AI after running synthetic data generation jobs.

00:35 - Deep Dive into QA Report Details

00:43 - Exploration of QA reports, including correlations, accuracy, distributions, and privacy.

00:51 - Starting the walkthrough of Python code that approximates how MOSTLY AI calculates its QA metrics.

01:31 - Demonstrating data synthesis using MOSTLY AI with the UCI Adult Income data set.

02:14 - Analyzing the QA report generated from the synthetic data job.

02:52 - Step-by-step guide to calculate both accuracy and privacy metrics manually.

02:59 - Checking Python library versions and preparing the target data set for analysis.

07:53 - Creating plots for univariate and bivariate accuracy metrics using Python.

00:09:17 - Explanation of how MOSTLY AI calculates privacy metrics, including distance measurements and nearest neighbor analysis.

Transcript

[00:00:00] Welcome to this tutorial on quality assurance. In this tutorial you will learn the key concepts behind MOSTLY AI's QA reports. This will help you understand how MOSTLY AI quantifies both the privacy and the accuracy parts of its QA reports. If you've run any synthetic data generation jobs in MOSTLY AI, you will likely have encountered a QA report like this already.

[00:00:21] So if you're in your synthetic data set section, you can click on any job that's complete and navigate to the QA report section here and this will give you some basic information about the data set. Some average metrics for the accuracy as well as for the privacy and you can then dig in deeper into the model QA report or the data QA report, looking at correlations, accuracy, univariate distributions, bivariate distributions and privacy.

[00:00:51] In this tutorial, you will walk through some code that approximates the ways in which MOSTLY AI calculates these numbers. We've made some minor changes to the code just to make it more legible and to make it fit within the length of a hopefully not too long video tutorial but basically this will give you a good sense of how MOSTLY AI works under the hood, to give you these numbers that you see here.

[00:01:12] If you're interested to move beyond just the basic level understanding and understanding just the key concepts then I recommend checking out our documentation, our peer-reviewed journal paper as well as the benchmarking article that we have on our blog which walks through all of this in a step-by-step way and really digs into each of the concepts. So we'll start by taking an original data set, in this case.

[00:01:34] Again, the UCI Adult Income data set and synthesizing it in MOSTLY AI. We'll do that with default settings and then we'll just take a quick look at the QA report that MOSTLY AI outputs for you.

[00:01:45] We'll then proceed to run through some code, step-by-step, to calculate both accuracy and privacy metrics ourselves and get a better understanding of how this works. So let's jump in. We'll start by just checking our numpy and pandas version if you're running something other than pandas 2.0, please run this pip install line here to make sure you have the right version and we can then access the target data set. This is the original US Census data set. And we can then use this to create a synthetic version.

[00:02:20] Navigate to your MOSTLY AI account and start a new synthetic data generation job. We will access the original data set here and upload it. And we can just launch the job straight away with the default settings. I've already launched and completed this job here, so we can use that to take a look at the QA report. So this is our synthetic data ready for us and if we take a look at the QA report we see here the basic data set information.

[00:02:50] We see an overall accuracy of 98.8, split into a univariate and a bivariate which we'll talk about more soon and we see some metrics for the privacy. And of course you can dig on deeper to look at univariate and bivariate distributions which we'll do in a little bit. But this is just to kind of give us an overall sense of the kind of information that MOSTLY AI provides us with, once we have the synthetic data set.

[00:03:16] generated. Let's now load this synthetic data set into Colab. I should download it first, of course. Save that to disk and we can then upload the file here and just to confirm that this works, let's just sample five records of each. And it looks like the data has loaded in correctly. So now that we have this, let's see if we can replicate the accuracy metrics that we saw in our QA report. So right, we can go back here. Look at the QA report and we see an accuracy of 98.8 split into univariate and bivariate.

[00:03:55] Let's remember those. So just to give you an overall intuition of how this works. So both for the accuracy and the privacy, as we'll see later, what we're doing essentially is measuring the distance between the synthetic data set records and the original records and we want those distances between the synthetic and the original records to be no larger than the difference between the original records themselves. To explain this a little better, let's take a look at the blog article here and we see this diagram which shows the way that this works.

[00:04:22] So we'll start with the actual data set. Right, we have our UCI Adult Income data set and we split that into a 50/50 split. A training data set and a holdout or evaluation data set. We train our generator on the 50% of training data and use that to create a synthetic data set. We will then compare the difference between the synthetic data set here and the training data set to the difference between the training data set and the holdout data set which both come from the actual data set itself. And

[00:04:57] what we want to see here is that the differences between these two are no bigger than the differences between these two and that would give us a good indication that both for privacy and for accuracy we have done a good job.

[00:05:10] So for the accuracy metric we are calculating something called the total variational distance. Won't go into the details of what this means exactly but essentially, again, we're looking at the sum of deviations between the distributions.

[00:05:26] And what we're aiming for is for the synthetic data records to be as different from the training records as the holdout ones are from the training records, because if the synthetic data records are closer to the training records then that means they are more similar to the training records than the holdout ones which means that we've probably just learned specific characteristics of the training data set, which means that our privacy is compromised.

[00:05:51] However, if the synthetic data records are more different, significantly more different, from the trading data set as the holdout ones are from the training data set then that means we've lost accuracy and our data utility is compromised.

[00:06:04] So we'll take the total variational distance and we subtract it from one or from 100, if we want a percentage, and that is how we get the percentage accuracy value that we see in the accuracy report. If you want to learn more about this implementation and the mathematical details behind how this works, I would recommend checking out the peer-reviewed journal paper, available here and that digs in a lot

[00:06:26] deeper. But for now let's take a look at this code. First, we have to pre-process the data to put it into bins in order to treat all the variables as categoricals and then we can define a helper script to calculate the accuracies. We will restrict ourselves to just 100,000 records and then we'll go ahead and bin the data and calculate the univariate accuracies.

[00:06:49] So these are the accuracies of, in this case, just five of the columns with respect to themselves. We can also calculate the bivariate accuracies which measure the accuracy of the relationship between two columns. And once we have those we can calculate the average bivariate accuracy which is the average accuracy of a column to all of the other columns in the data set.

[00:07:08] And then we get, per column, the univariate and bivariate accuracies. We can then take the average of this column and of the bivariate column and the average of those two to get our overall accuracy score. In this case, 98.8 which is actually exactly what we have here in the QA report.

[00:07:32] So we've replicated that well and the univariate and bivariate come quite close. It's exactly the same for the bivariate and the univariate is just 0.1% off, so this is because the implementation is slightly different but we're actually essentially getting the same results here. Now let's take a look at how we implement the visualization part of the accuracy score.

[00:07:53] So here we have the Python code to create the accuracy plots. This, again, is the modified, trimmed down version of the code, so it might look long but this is actually the concise version, because it actually takes a lot

[00:08:06] of tweaking to accommodate for specific edge cases and to get the plots right. But we can run this code and then use it to create our univariate plots. In this case, we're just plotting five randomly selected columns and we see here a relationship plot with the synthetic and the target distributions.

[00:08:30] And if we compare that to the univariate distributions in our QA report for the relationship column then we should see something very similar. Of course the plot is formatted a little bit differently but those look like they match up nicely. And you can of course take some more time to look at each column here and explore them for yourself. We can also plot the bivariate plots, getting here, for example, the occupation and the relationship.

[00:09:00] And if we go back to our QA report and go to the bivariate, we look for the occupation and relationship columns. All right, now let's proceed to look at how MOSTLY AI calculates the privacy metrics. And again, here, for privacy we're looking at a measure of distance. Specifically we're looking at the synthetic data records and their nearest neighbor in the original data set and that could be either in the training or in the holdout data set.

[00:09:36] So just like with accuracy, we start with an actual data set and we split that into two and create a synthetic data set based off of the training data set. We then take samples from this synthetic data set and for each synthetic data record, we look for its nearest neighbor or the most similar record in the original data set and we mark whether that most similar record - so the nearest

[00:10:00] neighbor - is in the holdout or in the training data set. And when we've done that for all of the synthetic data records, we hope to get a 50-50 split. And this would indicate that the synthetic data set is no different from the training data set than from the holdout data set.

[00:10:14] This would indicate that we haven't just copied the patterns, exact records in the trading data set which would mean that we're running the risk of revealing private information but rather that we've extrapolated to the general statistical trends.

[00:10:26] So to look at another diagram here, we'll take the original records, turn those into synthetic records and then for each synthetic record, check whether it's closer to a training or to the holdout record and sum those all up and hopefully we end up with a 50% split.

[00:10:45] So here we are using scikit-learn and by running this code we are calculating the distances to the nearest neighbors. We're then running a k-nearest neighbor search for both the training and the synthetic data sets.

[00:11:00] And once this is done, we can calculate two privacy metrics: the normalized distance to the closest record - that's the DCR over here, and the nearest neighbor distance ratio, both of which we're using the fifth percentile.

[00:11:16] So we calculate these metrics here and without going into the details of exactly how this works and how this is calculated, what's important to know here is that the value that we get for the synthetic metric in both cases should not be lower than the value we get for the original. Right? So this is a measurement of distance and they

[00:11:35,480] should be similar but not lower. All right, and with this we come to the end of this tutorial on quality assurance. There's of course a lot more you could do here. You could try the same exercise out with different data sets to really get the hang of how this works.

[00:11:50,200] And if you're interested to dive deeper into the mathematical operations that are underlying this quality assurance framework then I would recommend checking out our peer-reviewed journal paper which really dives a lot deeper into how the framework works exactly. And if you're interested to see how MOSTLY AI compares in performance to other synthetic data generators out there, then I would recommend reading the rest of the blog post that I've been referring to in this tutorial in which we evaluate MOSTLY AI against seven other synthetic data generations on four unique data sets.

[00:12:23,800] And if you do work through this code and run into any questions or things that you would like to share, please don't hesitate to reach out to us. We always love hearing from our users. Thank you so much for your time and see you in the next one!

Ready to get started?

Get started for free or get in touch with our sales team for a demo.
magnifiercross