[00:00:01] In this short video, we're going to talk about the most important aspect of production-ready synthetic data, and that's accuracy.
[00:00:09] MOSTLY AI offers the greatest accuracy of any synthetic data platform on the market today. As part of the training and generation process, MOSTLY AI provides users with a QA report that details just how accurately it can model and reproduce distributions, relationships, and dataset properties that follow closely to the original data, but ensuring that the synthetic data itself is privacy-safe and ready for wider use.
[00:00:39] Our headline accuracy number gives an overall sense of how closely the synthetic data represents the source, and this is fed from accuracy calculations across each of the attributes in the dataset. Let's take a look at these now.
[00:00:55] For each attribute, we check accuracies between the original distribution and its synthetic counterpart. Data is collected into discrete bins based on the most common categories or distribution ranges, depending on the attribute's data type.
[00:01:12] Accuracy is a statistical term that measures the distance between the synthetic and original data. Over the full dataset, a total distance can be calculated that shows the overall variation of the synthetic records from the source, and this is reported as a percentage.
[00:01:29] The closer to 100% the more accurate the representation. Accuracy is calculated for individual or univariate distributions, pairs of variables in bivariate distributions, and sequential or time series data using an additional measure called coherence.
[00:01:49] In the QA report, major differences between the original and synthetic datasets are shown with strong colors, and can often be a signal that further model configuration is needed, perhaps to check for rare categories in the data, or to check that a column has been correctly encoded.
[00:02:07] MOSTLY AI provides a QA report for the model itself, with information about the accuracy of the AI training process, and a data QA report that describes the accuracy of the synthetic data that has been generated using the trained model.
[00:02:22] With the MOSTLY AI QA report, we get our first glimpse at the overall accuracy of the process, but of course, the real evaluation happens when comparing the generated data in downstream tasks.
[00:02:36] For synthetic data to be valuable, it needs to be highly representative of the original data. MOSTLY AI uses state-of-the-art, proprietary algorithms to create synthetic data of the highest accuracy in the industry.
[00:02:51] Our synthetic data is so realistic that it acts as a drop-in replacement for original data in any situation. Visit mostly.ai to sign up for a free account to get started with creating your own synthetic data, and checking the accuracy of the generated data yourself. This is the first step in your synthetic data journey.