This tutorial demonstrates the Train-Synthetic-Test-Real (TSTR) evaluation method to assess the quality of synthetic data. This approach ensures that Machine Learning models trained on synthetic data perform effectively, providing a reliable measure of the data’s utility for downstream ML tasks.

Here is what you'll learn:

[00:00 – 00:06] What is the TSTR Evaluation Method
[00:06 – 00:27] Importance of Evaluating Synthetic Data Quality
[:00:27 – 00:36] Using MOSTLY AI & Databricks
[00:36 – 01:04] Understanding Data Organization on Databricks: Catalogs, Databases, Tables
[01:04 – 01:30] Preparing the Census Data: Creating Training and Holdout Sets
[01:30 – 01:51] Establishing Connectors for the Catalog Job in MOSTLY AI
[01:51 – 02:07] Setting up the Census Training Data SD Job
[02:07 – 02:43] Running the Job and Reviewing the Output Data
[02:43 – 03:22] Overview of Machine Learning Tools in Databricks
[03:22 – 03:52] Using Databricks for AutoML Experimentation
[03:52 – 04:25] Creating and Running AutoML Experiments
[04:25 – 04:59] Analyzing Feature Importance with SHAP Values
[04:59 – 05:42] Comparing Results Between Original and Synthetic Data
[05:42 – 06:14] Utilizing Databricks’ Model Registry
[06:38 – 07:14] Evaluating the Predictions on the Holdout Data
[07:14 – 07:53] Why Use Synthetic Data for ML Training

Run the experiment yourself on MOSTLY AI's synthetic data platform:
https://bit.ly/43IGYSv

Subscribe to our channel: https://bit.ly/3ZTtV0A
Follow us on LinkedIn: https://www.linkedin.com/company/mostlyai/
Visit our website: https://mostly.ai/