💡 Download the complete guide to AI-generated synthetic data!
Go to the ebook
December 11, 2023
3m 21s

MOSTLY AI: Experience unparalleled synthetic data accuracy

Transcript

In this video, you'll learn about the most important aspect of production-ready synthetic data - accuracy.

Accuracy is the statistical term that measures the distance between synthetic and original data, giving you the overall sense of how close the synthetic data represents the source.

Nick will teach you how to use the MOSTLY AI platform to judge accuracy, ensuring the synthetic data is privacy-safe and ready for wider use.

Here's what you will learn:
➡️ 000:00 - Introduction to Accuracy
➡️ 00:45 - Accuracy Metrics Overview
➡️ 01:20 - QA Report and Major Differences
➡️ 01:50 - Model Configuration Signals
➡️ 02:10 - Importance of Data Representation
➡️ 02:30 - MOSTLY AI's Advanced Algorithms
➡️ 02:45 - Realistic Synthetic Data

🔗 Get started by registering for a free account on MOSTLY AI's synthetic data platform: https://tinyurl.com/ymen9zz7

🖇️ Learn more about optimizing your training sample for synthetic data accuracy from this blogpost: https://tinyurl.com/tmt6wwyt

Transcript

[00:00:01] In this short video, we're going to talk about the most important aspect of production-ready synthetic data, and that's accuracy.

[00:00:09] MOSTLY AI offers the greatest accuracy of any synthetic data platform on the market today. As part of the training and generation process, MOSTLY AI provides users with a QA report that details just how accurately it can model and reproduce distributions, relationships, and dataset properties that follow closely to the original data, but ensuring that the synthetic data itself is privacy-safe and ready for wider use.

[00:00:39] Our headline accuracy number gives an overall sense of how closely the synthetic data represents the source, and this is fed from accuracy calculations across each of the attributes in the dataset. Let's take a look at these now.

[00:00:55] For each attribute, we check accuracies between the original distribution and its synthetic counterpart. Data is collected into discrete bins based on the most common categories or distribution ranges, depending on the attribute's data type.

[00:01:12] Accuracy is a statistical term that measures the distance between the synthetic and original data. Over the full dataset, a total distance can be calculated that shows the overall variation of the synthetic records from the source, and this is reported as a percentage.

[00:01:29] The closer to 100% the more accurate the representation. Accuracy is calculated for individual or univariate distributions, pairs of variables in bivariate distributions, and sequential or time series data using an additional measure called coherence.

[00:01:49] In the QA report, major differences between the original and synthetic datasets are shown with strong colors, and can often be a signal that further model configuration is needed, perhaps to check for rare categories in the data, or to check that a column has been correctly encoded.

[00:02:07] MOSTLY AI provides a QA report for the model itself, with information about the accuracy of the AI training process, and a data QA report that describes the accuracy of the synthetic data that has been generated using the trained model.

[00:02:22] With the MOSTLY AI QA report, we get our first glimpse at the overall accuracy of the process, but of course, the real evaluation happens when comparing the generated data in downstream tasks.

[00:02:36] For synthetic data to be valuable, it needs to be highly representative of the original data. MOSTLY AI uses state-of-the-art, proprietary algorithms to create synthetic data of the highest accuracy in the industry.

[00:02:51] Our synthetic data is so realistic that it acts as a drop-in replacement for original data in any situation. Visit mostly.ai to sign up for a free account to get started with creating your own synthetic data, and checking the accuracy of the generated data yourself. This is the first step in your synthetic data journey.

Ready to try synthetic data generation?

The best way to learn about synthetic data is to experiment with synthetic data generation. Try it for free or get in touch with our sales team for a demo.
magnifiercross