🚀 Launching Synthetic Text to Unlock High-Value Proprietary Text Data
Read all about it here
November 23, 2023
9m 15s

Synthetic Data Generation with MOSTLY AI

Join Nick from MOSTLY AI as he guides us through an insightful demo of MOSTLY AI's synthetic data platform, showcasing the process of creating statistically representative synthetic data. This comprehensive tutorial dives into the world of synthetic data generation, and all the useful features you need for next level machine learning tasks, offering privacy-secure and highly accurate synthetic datasets together with advanced options, like data upsampling.

In this video, you'll discover:

➡️ How to utilize the MOSTLY AI platform to generate synthetic data from sample datasets.
➡️ The process of configuring data settings and generating specific volumes of synthetic data.
➡️ A walkthrough of MOSTLY AI's QA report for evaluating the accuracy and privacy metrics of synthetic data.
➡️ Strategies for rebalancing datasets to enhance machine learning models and avoid bias.
➡️ The benefits of using synthetic data for data science teams, including privacy-safe augmentation, upsampling, data imputation, and data experimentation.

Whether you're a data science professional or enthusiast, this video provides a deep understanding of how to leverage synthetic data in your projects. Get ready to explore the intricacies of synthetic data modeling and how it can revolutionize your approach to data analysis and machine learning.

🔗 Start your journey with MOSTLY AI today: https://bit.ly/3M8Lhkb

Key moments:

0:00: Introduction to the MOSTLY AI Synthetic Data Platform
0:07: Generating Synthetic Data
1:03: Training Settings for Synthetic Data Generation
2:19: Data Configuration and Generation
3:06: Evaluating Synthetic Data Accuracy and Privacy
4:08: Rebalancing Datasets
5:54: Practical Applications for Data Science Teams

Transcript

[00:00:00] Hi, synthetic data users. I'm Nick from MOSTLY AI, and welcome to the MOSTLY AI platform demo. We're going to explore how to create statistically representative synthetic data that can be used for data collaboration, traditional analysis, and many different machine learning tasks. With MOSTLY AI, we can generate synthetic data that's both privacy secure, as well as highly accurate when compared back to original sources. Let's get started.

[00:00:29] In this demo, we have a classic machine learning dataset that contains adult census records. We're working with 12 columns of data, and around 50,000 rows of data, and we include this as one of our sample datasets when you sign up for free on the platform. Once the data is selected, we can tell MOSTLY to learn the underlying patterns, relationships, and statistical properties of this source data using artificial intelligence. Training our model based on accuracy, optimized for training speed, or even a turbo mode for rapid iteration as we learn.

[00:01:08] We can also configure how much of the original dataset should be used for training, and how many cycles of training we need to complete our work. Now, we can leave these settings as default for now. If we take a closer look at our data settings, we'll see that we're working with columns like age, education, occupation, and income as attributes for this subset of the US census.

[00:01:33] Each column can be configured individually, but the platform has made some smart choices for us to get started. We can also configure how many records we'd like to generate from our synthetic data model. Now, if we leave this field blank, MOSTLY AI will generate the same volume of data that it finds in our source, but we can specify exactly what we need. For example, let's generate 1,000 records of synthetic census data to get us started.

[00:02:04] Directly from this screen, we can literally kick off our synthetic data modeling and generation process with a single click of a button. We get regular feedback from the platform as the AI model completes its process. When the job is finished, we can take a look at what's been synthesized.

[00:02:24] Okay. So, our job has completed. Let's click into the synthetic dataset and see how MOSTLY AI presents us with its results. We can see that the platform has generated exactly 1,000 synthetic data records that follow the patterns in our underlying data really closely. Let's explore this in more detail by taking a look at a preview of some of those records in the table here.

[00:02:51] A quick check, and this synthetic data certainly passes a simple visual inspection with census data that closely resembles the original source. How can we trust and quantify the quality of this synthetic dataset?

[00:03:01] With MOSTLY AI's QA report, we can explore detailed metrics for accuracy, that captures how well the synthetic data follows the distributions of the original datasets, and here, we're seeing an accuracy score of close to 100% against the source data, indicating that our model has picked up on almost all the patterns and statistical properties from the real census data, and reproduced them here.

[00:03:34] We also care greatly about privacy. Ensuring that our model isn't producing synthetic data that's a little too close to the original, perhaps inadvertently leaking some of the original data into this synthetic dataset. To check this, we look for privacy metrics in our synthetic data, that closely match those found in the original dataset.

[00:03:57] To build confidence that we can use this data, while also protecting the privacy of the source. Our QA reports share interactive correlation matrices for the model and the generated datasets that explore correlations between attributes in both, with a third matrix that highlights

[00:04:17] any significant differences between the two.

[00:04:20] In this case, minor differences only, as the synthetic model has done a great job of closely following the original.

[00:04:28] We can explore a breakdown of accuracy for each data column, as well as individual plots showing how tightly the distributions are mirrored in the synthetic data.

[00:04:40] All of these charts are interactive, and it's really easy to search for particular columns of interest, such as male-female split, and then expand this view for a closer look.

[00:04:52] We can keep our focus on this column, and explore how multiple columns relate to each other. Such as with marital status, which again, we can expand and explore further.

[00:05:04] To the left, the original patterns found, and then over to the right, in green, are the patterns found in the synthetic dataset.

[00:05:12] Very similar distributions between the two, indicating that our synthetic data is a great proxy for this original census data.

[00:05:22] From here, we can quickly download our synthetic data in CSV or Parquet files. We can, of course, generate more synthetic data using our trained model, or we can share our results with colleagues in order to collaborate further using our privacy-enhanced synthetic data within MOSTLY AI.

[00:05:43] Now, for data science teams, this is just the start of the journey. Let's talk about how synthetic data can be used to augment our machine learning workflows.

[00:05:54] For example, many datasets contain imbalance in the distribution of data for certain attributes or categories. Dealing with this imbalance traditionally requires sampling down a dataset, removing records in order to ensure that machine learning models train without bias, and often losing considerable detail in the process.

[00:06:14] With MOSTLY AI, we can use synthetic data to rebalance the dataset to increase the representation of minority categories without having to throw data away.

[00:06:25] Let's take our census data one more time, and let's explore how we can use synthetic data to rebalance a dataset, to increase the representation of minority categories. Rebalancing gives our machine learning models a fair chance to learn all the classes present in the data, and enhances the model's ability to make accurate predictions.

[00:06:48] Let's head to our data settings, and let's rebalance the male-female split in this census dataset. We choose our field, we click on the settings icon to the right, and flip the switch to use this column to rebalance the table. Let's add a row, and we'll make sure that females are now represented as 50% of the generated synthetic data, by intentionally upsampling the data as part of our process.

[00:07:18] Let's click save, and we can create our synthetic dataset as before. Now, once this is finished, let's take a look at the distribution of our data using the MOSTLY AI QA reports.

[00:07:30] Now, remember, the model QA shows us how the model learns from the original data, and how well it recreates the original distributions. In contrast, the data QA report shows us that we have a new distribution that respects our desired split for male and female. Over here to the left, we can see approximately 50% of this sample now contains females, up from just 33%, in our original source data.

[00:07:59] That's not all. By changing the distribution of females in our generated data, we get notable changes in other distributions, too. Some of them straight forward, such as fewer husbands in the data, but others, much more interesting. Such as significant changes in marital status or occupation, as well as many others.

[00:08:20] The generative model understands the correlations between all these columns in the original dataset, and changes the synthetic output data accordingly. For data science teams, the rebalancing of data categories

[00:08:34] is just one approach, and using synthetic data

[00:08:38] as a valuable asset for augmenting, imputing, and experimenting with data

[00:08:43] in a privacy-safe way, with no direct links

[00:08:47] back to the original source.

[00:08:49] We can share it more easily between our teams,

[00:08:52] we can start collaborating to unlock our data in innovative new ways.

[00:08:57] We can't wait to see what you can achieve with MOSTLY AI and synthetic data.

[00:09:02] Visit mostly.ai to get started today.

[00:09:06] [music]

Ready to start?

Sign up for free or contact our sales team to schedule a demo.
magnifiercross