[00:00:00] Hi, welcome to this tutorial about Train-Synthetic-Test-Real.
[00:00:05] In this tutorial, you will learn how to assess the quality of synthetic data specifically for use with downstream machine-learning models.
[00:00:14] As the name suggests, we will be training using synthetic data and testing the model performance using real data.
[00:00:23] In this way, we'll be simulating what happens in a real-world machine learning project in which a model is trained on historical data but has to perform on data it has never seen before.
[00:00:34] All of the material for this tutorial is publicly available in this notebook, so you can follow along and you can even change the code to try a different data set or a different model or even a different synthesizer.
[00:00:46] Let's jump in.
[00:00:48] Let's first take a high-level look at what we'll be doing in this tutorial. As this diagram shows, we'll start with our actual data set, which we split into a section used for training and a part used for evaluation.
[00:01:02] The part used for evaluation is called the holdout set. It is very important that the holdout set is never used to train the machine learning model that we want to evaluate.
[00:01:12] In this way, we simulate the real-world situation in which machine learning models are trained on historical data but perform on unseen data.
[00:01:23] With our training dataset, we do two things. We first generate synthetic data, which we will then use to train machine learning model one.
[00:01:34] We then train another machine learning model directly on the training data.
[00:01:40] We then evaluate both of these models on the holdout dataset, which is real unseen data, and we then compare the performance.
[00:01:48] In the ideal situation, we will see that the model trained on synthetic data comes close to the performance of the model trained on real data.
[00:01:56] This means that we can confidently train our model on synthetic data
[00:02:00] knowing that we have none of the privacy risks, but also lose nothing in terms of performance.
[00:02:06] Let's jump into some code and see what this looks like in practice. All right, so before we can do anything, we will need some data.
[00:02:14] In this tutorial, we'll be using the UCI Adult Income dataset. The dataset consists of about 50,000 records and 15 columns, 14 of these columns we will use as our features to predict one target variable, which is whether or not a respondent has reported a high annual income.
[00:02:34] To make it just a little bit easier for you, we have already split the dataset for you, so you can go ahead and download the training part of the dataset by clicking here,
[00:02:46] and then using command or Ctrl+S to save the dataset to your disc. I've already done so, but we would just click Save here. This is an 80% sample of the full dataset. The other 20% will be used as the holdout set for evaluation.
[00:03:06] The next step is to synthesize this original training dataset using MOSTLY AI. Navigate to your MOSTLY AI account and to the Jobs tab, and then create a new job by uploading the training data.
[00:03:23] You can just use the default settings here, and this will take a couple of minutes, so I'll see you in a little bit.
[00:03:30] All right, our synthesization has been completed. We see the job here, it's finished, and over here on this colorful button, you can download your synthetic data either as CSV or Parquet. Let's do CSV.
[00:03:47] Again, I've done this already, but here you can save it to disc. We then go back to our notebook and upload the generated synthetic data into the notebook using this cell.
[00:04:01] We'll run the cell and then choose the synthetic dataset and upload it. All right, now that it has been uploaded into our notebook, we can explore this synthetic dataset a little bit. For example, by sampling 10 random records.
[00:04:18] or by taking a little bit of a more specific look, if you know some Python. For example, looking at female professors of the age 30 or younger, not 300, 30, and sampling five of those types of records. If we know our real dataset well then we could start to compare here a little bit to see if the synthetic data is showing similar patterns.
[00:04:49] Mostly we're interested in our target feature, which is the high-income respondents. Let's see what the division looks like in the synthetic dataset. We see about 30,000 low-income respondents and about 10,000 high-income. We could get more specific here looking at non-US citizens that have been divorced, for example. This just gives us some sense of what our dataset looks like.
[00:05:16] Why we're really here is to compare the performance of this synthetic data on a machine learning task. Specifically, what we'll do here is train a light GBM classifier on top of the synthetic data and then also train it on the real data and then compare the performance of both. Ideally, we would see very similar performance, which would give us confidence that our synthetic data accurately represents the pattern seen in the real data, but of course, with the benefit of significantly reducing the risk of leaking private information.
[00:05:49] Let's define the code to be able to run our models. If you don't know much Python, don't worry about it. We've done this for you. You can just run the cell. Now we're all set to run our experiment. We will be training a model on the synthetic data as well as training a separate model on the real data, and then evaluating both of these using the holdout set and then comparing their performance.
[00:06:11] In order to measure the performance, we'll be using two metrics accuracy and AUC, or area under the curve. Together, these two metrics will give us a complete picture.
[00:06:21] of our model's performance. In this cell, we will prepare the synthetic data, train the model, and then evaluate.
[00:06:30] We see that we get an accuracy score of 87.1 and an AUC of 92.3. Furthermore, the scores are split according to their actual outcome. We now have the performance of our model on the synthetic data. Let's run the same model on our real dataset and compare.
[00:06:50] We do the same thing for the real dataset. We see that we get an accuracy of 86.9 and an AUC of 92.8.
[00:06:58] I don't quite remember exactly what I just said, but they seem similar, but just to be sure, let's run them here, create a little table to compare.
[00:07:07] We see that the accuracy on our synthetic dataset was 87.1%, on the real dataset was 86.8, and the AUC was 92.3 for the synthetic dataset and is 92.7 for the real dataset.
[00:07:21] In both cases, we see a difference smaller than 0.5%. This means that we are approaching nearly identical performance between the two models, which gives us confidence that the synthetic dataset is accurately modeling the underlying patterns that were present in the real data.
[00:07:38] In this way, we can preserve the privacy of the respondents in our dataset by using synthetic data to train our production-grade machine-learning model.
[00:07:46] All right, and with this we come to the end of the Train-Synthetic-Test-Real tutorial. Of course, don't stop here, there's a lot more you can do. You can plug in different data sets, including your own. You can try out different machine learning models as well as different synthesizers to compare performance.
[00:08:01] If you're interested in comparing performance across synthesizers, take a look at our benchmarking blog in which we compare eight different synthesizers on four different data sets.
[00:08:11] Thank you so much for taking the time to watch this tutorial.
[00:08:15] We hope it was helpful, and please feel free to reach out to us with any questions or suggestions.
[00:08:19] We'd love to hear from you. Thanks for watching.