💡 Download the complete guide to AI-generated synthetic data!
Go to the ebook
October 12, 2023
12m 58s

Synthetic Data Quality Evaluation Tutorial: Train-Synthetic-Test-Real Using AutoML on Databricks

This tutorial demonstrates the Train-Synthetic-Test-Real (TSTR) evaluation method to assess the quality of synthetic data. This approach ensures that Machine Learning models trained on synthetic data perform effectively, providing a reliable measure of the data’s utility for downstream ML tasks.

Here is what you'll learn:

[00:00 – 00:06] What is the TSTR Evaluation Method
[00:06 – 00:27] Importance of Evaluating Synthetic Data Quality
[:00:27 – 00:36] Using MOSTLY AI & Databricks
[00:36 – 01:04] Understanding Data Organization on Databricks: Catalogs, Databases, Tables
[01:04 – 01:30] Preparing the Census Data: Creating Training and Holdout Sets
[01:30 – 01:51] Establishing Connectors for the Catalog Job in MOSTLY AI
[01:51 – 02:07] Setting up the Census Training Data SD Job
[02:07 – 02:43] Running the Job and Reviewing the Output Data
[02:43 – 03:22] Overview of Machine Learning Tools in Databricks
[03:22 – 03:52] Using Databricks for AutoML Experimentation
[03:52 – 04:25] Creating and Running AutoML Experiments
[04:25 – 04:59] Analyzing Feature Importance with SHAP Values
[04:59 – 05:42] Comparing Results Between Original and Synthetic Data
[05:42 – 06:14] Utilizing Databricks’ Model Registry
[06:38 – 07:14] Evaluating the Predictions on the Holdout Data
[07:14 – 07:53] Why Use Synthetic Data for ML Training

Run the experiment yourself on MOSTLY AI's synthetic data platform:
https://bit.ly/43IGYSv

Subscribe to our channel: https://bit.ly/3ZTtV0A
Follow us on LinkedIn: https://www.linkedin.com/company/mostlyai/
Visit our website: https://mostly.ai/

Transcript

[00:00:00] In this video, I'm going to demonstrate the process of evaluating the quality of synthetic data based on its utility for a downstream machine learning task. Now this method is commonly referred to as the Train-Synthetic-Test-Real evaluation, and it serves as a robust measure of synthetic data quality because ML models rely on the accurate representation of deeper underlying patterns to perform effectively on previously unseen data.

[00:00:27] As a result, this approach offers a much more reliable assessment than simply evaluating just the higher-level statistics. Now for this demo, we're going to be using the MOSTLY AI platform for the synthetic data piece as well as Databricks for the machine learning, and the data source and output piece.

[00:00:46] A few things to note that we did before the demo here, was, I created some datasets in Databricks, and all I did was I created a catalog in Databricks with some various schemas and some various tables within those. Now Databricks operates in this three-level namespace where you have catalogs, within catalogs you have databases, and within databases you have tables.

[00:01:11] The data that we're focusing on is in this default database, and it is this census data where we have a training and a holdout set. Now the whole set itself was about 50,000-ish rows. What we did was we just cut a 80% and 20%. So 80% for the training, 20% for the holdout for later testing in this demo.

[00:01:33] The next thing, in MOSTLY AI we established two separate connectors for the catalog job. We have a databricksDefault and a databricksSynthOutput. Now these connectors tie right into this default database as well as this synthoutput database, which is where the data will actually be pushed to when the job completes.

[00:01:55] Now in the catalogs, I created this Census Training Data SD Job, and we just used

[00:02:02] some of the default settings here for the demo.

[00:02:04] We used Turbo just to make the speed a little bit quicker. Then for the output settings,

[00:02:08] we did the destination being that synth output table

[00:02:15] or synth output database I should say within Databricks so once we actually ran the job

[00:02:20] and we go into the most recent job run and we look at the data, we can actually see the job ran successfully,

[00:02:30] about 39,000 rows which makes sense because it's about 80% of the total as we mentioned all the various columns, and if we scroll down we can see the accuracy,

[00:02:40] all of the QA reports, the privacy metrics, any of the correlations both univariate, bivariate and how the data looks. It looks like from the logs and everything this job ran successfully.

[00:02:53] When we go into Databricks, since we provided the destination database to be this synthoutput, we can see that this table called census_data_training_synth was pushed out.

[00:03:05] If we open this, we'll see the same data from a sample that we were just looking at in the MOSTLY AI platform. Now we have a original training set as well as a synthetic training set.

[00:03:18] Within Databricks, it offers a suite of different ML tools that are super helpful for running out-of-the-box machine learning.

[00:03:29] If we go to our experiments tab, I'm just going to open this in a new tab here. There's an option in Databricks called Create AutoML Experiment.

[00:03:39] Now, I did this already, so I'll be showing the results set here shortly because it does take a little bit of time to run both of these. Just to show what this looks like, all you do is you create a cluster.

[00:03:48] I have this one established before the demo started and running. This is going to be a classification machine learning problem.

[00:03:57] What we did was within our data, so jbcatalog, census_data_training.

[00:04:04] I select this prediction target being the income because what we're predicting is this income is greater than or equal to 50,000 or less than.

[00:04:15] Then in some advanced configurations, we just have different evaluation metrics we can choose from, different training frameworks. For this, I believe we did area under the curve.

[00:04:25] I did 15 minutes for this demo just because it's a little bit smaller of a dataset and I wanted to be able to iterate quickly and produce results.

[00:04:36] I ran these for both the original data as well as the synthetic data.

[00:04:43] Now going into the experiments here, if we look over, we'll see that this is the output of the framework or the output of the AutoML for the original data.

[00:04:58] You can see that the AutoML experiment run successfully. A lot of different metrics here that we logged, and you can see that the highest ROC-AUC score was this LightGBM at about 9.924.

[00:05:15] It ran some other ones like logistic regression, random forest, decision trees, et cetera.

[00:05:21] If we look at the same thing for the synthetic data, we'll see that very similar setup. The top one actually was also a LightGBM classification for this at about 0.915.

[00:05:35] A few things I want to show before going into too much detail here is if we click into these models, one thing that Databricks does out of the box actually is it logs a lot of these artifacts, whether it's the environment that you need to run for it to run the same model, the Python environment, the model itself, but it also is able to log various HTML, PNG files.

[00:06:05] We'll actually see this in the notebook shortly, but I think it's just a cool way to be able to compare pretty quickly a lot of different metrics that you want to bring out when you actually run your models.

[00:06:16] Going back to experiments, the next thing I want to show is what it does is it offers or it produces a best notebook for the best model. Why I like that is it produces this glass box approach, which we call it, which is-- what I'm showing right here are two notebooks, one for the original and one for the synthetic.

[00:06:41] This was all auto-generated by AutoML. All of those files that we were just showing, all of the pictures and everything, is a production of this code or an output of this code, I should say. I just really like its ability to create a foundation for the experiment itself.

[00:07:00] You can actually go through and see how it handled the various types of columns for the classification problem, how it did the train-test split, all of these different metrics that it's actually logging in the ML flow tracking server, which is was what I was just showing before this.

[00:07:19] As well as what I think is important here for purposes of this demo is some of the feature importances, because it actually has SHAP values embedded in the notebooks.

[00:07:31] In the original data, you could see the ones that bubble up that are most important for the label. If we look at synthetic data, and we do the same thing, same notebook here, but just going to the SHAP value specifically,

[00:07:47] you could see relationship, education number, capital gain, marital status as a top four. Here you could see marital status, capital gain, education number, age, relationship. Pretty similar, which would make sense because the synthetic data is pretty representative of the original data, and they were both the inputs to a LightGBM classifier, which you can see up here based on the title of the notebook.

[00:08:15] Just pretty cool. That's the first step here.

[00:08:18] Now the next step to actually see the validity and the reliability of the data is to actually test that against data that the model or neither of the data or neither of the models have seen yet. That's where this holdout data comes into play.

[00:08:33] To do this, if we go into Databricks and we just went to any of these models, so I'll just click on this one, we have this option here to register the model. When you register the model, it goes to a central repository for a model registry where it's available to everyone that has access.

[00:08:53] So if I go into models, you'll see that I did this already for the Census Original vF and then the Census Synthetic vF. When we click into these models, we have this option here to use this for inference. When we click this, and we click batch inference, what we can do is we can use a version.

[00:09:16] Just as a quick sidebar, if models are enhanced or a new version comes out and one goes to production, one goes to staging, the model registry is just a really good repository for that in Databricks. That's all we're looking at here. Since I only have one version of the model, I would obviously click that.

[00:09:36] Then what I'm doing for the input table now is going into the default here and using this census_data_holdout, so that 20% that neither of them have seen yet. I just provided an output table location. This is just a default.

[00:09:55] When you use this for batch inference, it produces a notebook that looks like this. There's going to be two here. One is for original and one is for synthetic. If we scroll down, you could see again, this is just notebooks that was produced by Databricks. I didn't create this from scratch. I did some tweaks to it, but it's mainly 95% just produced

[00:10:22] from Databricks. It's just creating some path inputs and outputs here. What we see is it's just defining some of these variables to make it more modularized for the rest of the notebook.

[00:10:35] What it does is it loads that specific model and specific version in and then predicts it against the holdout set. This is really the best way now where it's taking out all bias of anything that was put into the model to ensure that the predictions are still pretty close or on par from both the original and the synthetic data.

[00:11:00] If you look at this output display, what we see here is, if we scroll over this was the original data and now we have this additional column for the actual prediction that was created. Then what I did was I just added a few extra columns here, or rows in the notebook to basically get all the predictions, the ones that were correct, and divide them over the total.

[00:11:24] What we see is in this original data, it's about 86.7% correct predictions here. If we go into the synthetic data, which we're doing the same thing here, and if we scroll towards the bottom here, so same notebook, we see that the prediction here is about 85.6%,

[00:11:44] so just about a 1% difference so you can see that the data is pretty on par and pretty accurate compared to the original data and now we've removed all bias and really do have a reliable test against the original versus the synthetic data.

[00:12:02] A lot to unpack here, but just a way to show that the synthetic data that is produced with our generative AI is very high quality and on par with the original data which is shown in various metrics here from obviously the predictions being the most blatant. You can also factor in the feature importance as being pretty similar, the models being pretty similar of what was actually produced.

[00:12:30] and bubbled up to the top. All to go into that our synthetic data is very accurate,

[00:12:36] very representative for all types of downstream tasks, whether it's analytics, machine learning models,

[00:12:43] various use cases, and in the end still maintaining that privacy-preserving nature which is the true value proposition, making

[00:12:51] synthetic data better than the real data.

[00:12:54] Thank you.

Ready to get started?

Get started for free or get in touch with our sales team for a demo.
magnifiercross