💡 Download the complete guide to AI-generated synthetic data!
Go to the ebook
August 29, 2023
3m 21s

Rebalance data with MOSTLY AI's synthetic data platform

Trascript

📺 In this tutorial, we delve into the powerful rebalancing feature of MOSTLY AI's Synthetic Data Platform. We'll walk you through the process of reshaping variable distributions within a synthetic dataset. With MOSTLY AI's Synthetic Data Platform, rebalancing data distributions becomes a breeze, fostering more accurate AI models and insightful data analysis. If you're ready to experience the power of rebalancing, give it a try and witness the transformation of your synthetic dataset into a more representative mirror of reality.

⏱️ Timestamps:

00:00 Introduction to AI-powered Synthetic Data Generation
00:15 Exploring Rebalancing with MOSTLY AI's Platform
01:00 Uploading and Setting Up the US Census Income Dataset
02:10 Rebalancing the Gender Column
03:30 Understanding the Synthetic Data Preview and QA Report
05:10 Unveiling the Rebalanced Data Distribution
06:45 Demonstrating Impact on Other Variables
08:20 Realistic Population Distributions

🔗 Useful Links and Resources:

MOSTLY AI's Synthetic Data Platform ➡️ https://bit.ly/43IGYSv
US Census Income Dataset ➡️ https://mostly.ai/docs/datasets
Read about data augmentation ➡️ https://bit.ly/3OYcw1H

📈 Why rebalance? The impact and benefits:
🔹 Simulate minority class scenarios with precision.
🔹 Achieve realistic distributions that mirror actual populations.
🔹 Enhance AI model training by introducing diverse datasets.
🔹 Boost insights and accuracy in data-driven decision-making.

🚀 Ready to Get Started?
Visit MOSTLY AI's Synthetic Data Platform today and discover how you can harness the potential of rebalancing to elevate your AI projects to the next level.
➡️ https://bit.ly/43IGYSv

📢 Stay Connected:
For more exciting tutorials, tips, and insights, subscribe to our channel!

👍 Did you find this video helpful?
Give us a thumbs up and share your thoughts in the comments below. We love hearing from you!

Transcript

[00:00:00] Hi, there. Today, I'm going to show you the rebalancing feature of the MOSTLY AI Synthetic Data Platform, a powerful feature that allows you to change the target distribution of a variable in a synthetic dataset.

[00:00:14] For that, I'm going to use the US Census Income dataset, a dataset that contains about 50,000 records, 13 columns of data, with variables such as age, marital status, occupation, and so forth. We're going to upload that dataset here to our platform. I'm going to click Proceed here.

[00:00:37] We go into Data settings where we see all the variables. In this dataset, what we want to do is we want to rebalance the gender, the sex column. We go here into that. It's a categorical column that has either female or male in this dataset.

[00:00:54] What I'm going to do is I'm going to select here "Use this column to rebalance data" and I will add a row. I will say, female, I want to have 50%. We don't need to define the share of male or other categories here. That's going to be done automatically by the platform.

[00:01:13] I'm going to click Save and then Launch job, selecting your target destination, and off we go.

[00:01:24] I already completed this previously, so we don't have to wait. It's going to take about two minutes or so, but let's have a look now at the synthetic dataset. First, you're in the synthetic data preview. We see the dataset.

[00:01:35] Then, when we go into QA report, let's have a look here at what we can see. I'm going to go here into the Univariate distribution of the Model QA report, and here we see that the distribution of the synthetic data between your male and female is exactly like in the original, in the target data.

[00:01:55] It's about 66% males and about 33% females in the dataset, but if you now look here into the Data QA report,

[00:02:03] then go into the Univariate distribution here, we now see here the rebalanced data.

[00:02:08] Here we can see that the synthetic data shows about 49.8% males and 50% females. Exactly as we had defined.

[00:02:18] The cool thing about this feature is that not only this column is going to be rebalanced, but actually, all the other columns, all the other variables in that synthetic dataset will also be modified to reflect the change that we would expect to see in such a dataset where there's 50% females.

[00:02:35] For example, if we look at the relationship column, what we can see here is that the share of husbands went down from about 40% to 30%, whereas some of the other variables, the shares went up. For example, unmarried went up from 10% to 13%, wife went up from 4% to 7%, and so forth.

[00:02:53] This is how we would expect actually a population to look like that was an even split 50/50 between females and males.

[00:03:04] This is a powerful feature that allows you to modify a target variable in a synthetic dataset that allows you to upsample minority classes to simulate scenarios, what your data would look like with certain distributions.

[00:03:17] I hope you enjoyed the video. Thank you. Bye.

Ready to try synthetic data?

The best way to learn about synthetic data is to experiment with synthetic data generation. Try it for free or get in touch with our sales team for a demo.
magnifiercross