🚀 Launching Synthetic Text to Unlock High-Value Proprietary Text Data
Read all about it here
October 12, 2023
9m 52s

Fake or real? How to use a data discriminator for evaluating synthetic data quality

Dive deep into this tutorial on how to construct a machine learning model that effectively differentiates between fake and real data records. Whether you're trying to assess the quality of synthetic data or discern hybrid datasets, this guide will prove invaluable.

Here is what you'll learn:

00:00-00:02 - Fake vs Real Data Discriminator
00:05-00:20 - Purpose, Applications, and Evaluating Synthetic Data Quality
00:26-00:38 - Criteria for Realism in Synthetic Data
00:45-01:00 - Merging Real & Synthetic Data
01:11-01:18 - Impact of Limited Training Samples & MOSTLY AI's High-Quality Default Dataset
01:30-01:45 - Data Generation with MOSTLY AI
01:58-02:14 - Intentional Low-Quality to Concurrent High-Quality Synthetic Data Generation
02:37-03:00 - Job Completion, Downloads, and Uploading Data to Google Colab
03:06-03:53 - Starting with Low-Quality Dataset to Evaluation on the Holdout Dataset
04:02-05:10 - Dataset Overview, Discriminator's Performance, AUC Interpretation

Replicate the experiment using your data of choice, using MOSTLY AI's state-of-the-art synthetic data generation platform: https://bit.ly/43IGYSv

Transcript

[00:00:00] Hi, and welcome to this tutorial on building a fake versus real discriminator.

[00:00:02] In this tutorial, you will learn how to build a machine-learning model that is trained to distinguish between fake and real data records.

[00:00:08] This can be helpful in and of itself if you're given a hybrid dataset and want to pull out just the synthetic or just the real records.

[00:00:16] It's also another interesting measure to add to our toolbox in order to evaluate the quality of our synthetic data.

[00:00:23] The more realistic the synthetic data records are, the harder it will be for a machine learning model to distinguish between those and the real records.

[00:00:33] In this tutorial, we'll be doing the following. We'll start with an actual dataset.

[00:00:38] Again, the UCI adult income dataset, and we'll use that to create a synthetic version.

[00:00:45] We'll then merge these two to create one dataset containing half real records and half fake records.

[00:00:52] We'll then train a machine learning classifier on this hybrid dataset and then evaluate how it does on a holdout dataset.

[00:01:00] In order to make the analysis a bit more interesting, we'll start by creating a synthetic dataset intentionally of lower quality by limiting the number of training samples to just 1000 records.

[00:01:11] This will make it easier for our machine-learning model to pick up a signal and distinguish the synthetic from the real.

[00:01:14] We'll then compare this against a synthetic dataset generated using MOSTLY AI's high-quality default settings to see if the machine learning model is still able to pick up a signal and tell the real from the fake.

[00:01:28] Let's jump in. We'll start by generating our data.

[00:01:32] You can download the original dataset by clicking here and then pressing control or command S depending on your operating system to save this to disk.

[00:01:45] Next, you can head to your MOSTLY AI account and start a new job to create a synthetic version of this dataset.

[00:01:58] Remember that we'll start by intentionally creating a low-quality synthetic dataset. We'll set the training size to just 1,000 samples here instead of the full 49,000 records that are available.

[00:02:14] This is now running and we can at the same time also launch our other job using MOSTLY AI's default high-quality settings. We'll name this HQ at the end to be able to tear it apart, but other than that, we can just use the default settings and launch the job.

[00:02:37] These will take a couple of minutes to complete, so I'll be back when they're done. We're back. Once these two jobs have completed you can download the data as CSV and then we can return to our notebook.

[00:03:00] Since we're working in Google Colab, we will have to upload the data here. Let's start by working with the low-quality synthetic dataset. All right, and once that's been uploaded, we can continue.

[00:03:17] The following cell defines two functions to prepare or pre-process our data and to train our light GVM model, which will classify between this fake and the real data records. We can just run this and in the next cell, we concatenate the fake and the real datasets together to have a single hybrid dataset.

[00:03:40] We then take out a 20% holdout dataset for evaluation and we train our machine learning discriminator on the remaining 80% of the training dataset.

[00:03:53] Next, we will evaluate on the holdout dataset assigning a probability to each record, whether it's fake or real. You can see that here we have the holdout dataset and we have the split containing whether the record is fake or real and the probability that the model has assigned.

[00:04:13] We can then visualize this over the whole dataset to get a good sense of how our discriminator performed on this dataset. Remember, this is the low-quality synthetic dataset trained on just 1000 trading samples.

[00:04:25] We should expect the model to be able to pick up on some signal here. If we look at the performance measures graphed here, we see indeed that the area under the curve is practically 80% The area under the curve here can be interpreted to be the percentage of cases in which the model correctly predicts a fake record given a set of a single fake and real data point.

[00:04:53] In 80% of the cases here is getting it correct, which means that the discriminator has learned to pick up some signals and that allows it to tell a fake record from a real record. Let's take a little bit of a deeper look here to see if we can understand why the model is able to predict correctly. Let's sample the records that seem very fake to the model. These are records for which the probability of is fake is close to one.

[00:05:18] On first sight, these look pretty normal and it's maybe not intuitive to pick out a pattern, but if we look a little closer, we will notice that there is a mismatch between the education and the education_num columns. In the original dataset, there's a one-to-one relationship between education and education_num. Basically, they're both categorical variables and for each textual label in the education column, there's a matching education number.

[00:05:49] We see that all 10th-grade education category records have the number six assigned to them and no other numerical value. That's the same for all the other categories. If we look at the relationship between the education and education_num columns in the low-quality synthetic dataset, we see that this relationship hasn't been captured properly and that there's a big variance in the different values of the education_num column.

[00:06:16] We have not captured the one-to-one relationship here between these two columns. This is likely what's allowing our machine learning discriminator to recognize records as fake.

[00:06:27] Any record with the education value 10th, that doesn't have the corresponding education_num value of six, is going to be classified as likely a fake record.

[00:06:41] We can also take a look at some sample records that seem very real. Again, on first inspection, maybe not so obvious why the model is able to detect these as real with such confidence. If we look closer, we see that these are types of records that the synthesizer has failed to create. They occupy a part of the data space in which there are simply no synthetic records present. In this way, the model is able to say, "Okay, records that are in this part of the data space are most probably real because I've had no synthetic data records in that data space."

[00:07:20] All right, we've seen how the model performs on the low-quality synthetic data. There's definitely a signal here that the model's picking up and that's allowing it to correctly predict fake versus real for a considerable amount of cases. Let's now see what happens if we use our synthetic dataset that is trained using MOSTLY AI's default high-quality settings and see what happens to the AUC score.

[00:07:47] We'll go back up here and we'll rerun all the cells we just did, except that we will upload the high-quality version of the dataset. It's important to note here again that we haven't done anything special to make this dataset high-quality. This is just the baseline of high-quality synthetic data that MOSTLY AI delivers. All right. Once this is uploaded, we can simply rerun ourselves. We'll define our pre-processing and training functions. We'll concatenate the fake and the real data together. Again, take out the 20% holding set, train the model and score it. For easy comparison, let's take a look at this graph again.

[00:08:27] We see that the AUC has dropped significantly. We're getting an accuracy and an AUC of around 50%

[00:08:34] which means that the model is effectively just flipping a coin. I mean the AUC is not quite 50%. We're at 60, but it's getting close to being unable to distinguish whether records are fake or real.

[00:08:47] As I mentioned at the beginning of the tutorial, this is an interesting exercise in and of itself, but it's also a powerful tool in evaluating the quality of the synthetic data.

[00:08:56] We see that the quality of the synthetic data that we generated with MOSTLY AI's default settings is high enough to make it quite difficult for the model to tell it apart from real records.

[00:09:08] With that, we come to the end of this tutorial. There's of course a lot more you could do here. You could try out a different dataset.

[00:09:15] You could also try out a different kind of machine-learning model or even compare MOSTLY AI's synthesized data with that of other synthesizers.

[00:09:24] If you're interested in that kind of comparison between synthesizers, then I also recommend checking out our blog post on how to benchmark synthetic data generators where you can learn a lot about how this is done in state-of-the-art research papers.

[00:09:38] All right, thank you so much for your time. If you do work through this code and run into any questions or things that you'd love to share, please don't hesitate to reach out to us. We'd love to hear from you.

[00:09:48] Thank you for watching. See you in the next one.

Ready to start?

Sign up for free or contact our sales team to schedule a demo.
magnifiercross