💡 Download the complete guide to AI-generated synthetic data!
Go to the ebook
December 11, 2023
3m 15s

MOSTLY AI: Smart imputation for improved data quality

Transcript

Missing values are an issue for both humans and machines. Learn how to address the problem of missing data points using a synthetic data generator. MOSTLY AI, the world's leading synthetic data generator, works like autocorrect for your data, filling in the gaps with realistic, statistically representative synthetic data.

Using a real-world example, the UCI Adult Income Survey dataset, this tutorial guides you through configuring Smart Imputation on MOSTLY AI's free synthetic data platform. The resulting synthetic dataset not only eliminates missing values but also closely models the original distribution, showcasing the power of generative AI for realistic value replacements. MOSTLY AI's Smart Imputation enhances human readability and the training data used for machine learning models.

🔍 Key highlights include:
00:00 - 00:03: Introduction to Dealing with Datasets and Missing Values
00:14 - 00:29: MOSTLY AI's Solution: Smart Imputation
01:25 - 01:39: Creating Synthetic Dataset with Smart Imputation
01:43 - 01:55: Analyzing Synthetic Data and Gap Closure
02:07 - 02:19: Impact of Smart Imputation on Data Distribution
02:32 - 02:53: Recovery of Original Distribution through Imputation

🔗 Sign up for a free account on MOSTLY AI's synthetic data platform: https://tinyurl.com/ymen9zz7
🔗 Learn more about Smart data imputation from this blog: https://tinyurl.com/yjssz6bw

Transcript

[00:00:00] Let's talk about dealing with datasets that contain missing values.

[00:00:04] This can be a real challenge, especially if the remaining values we have end up creating a distorted or biased view about overall distribution.

[00:00:14] With MOSTLY AI, we can help close the gaps in your data through a process we call smart imputation. Let's take a look with a quick example.

[00:00:23] Suppose I take a known dataset, such as the famous UCI Adult Income Survey from the US Census Bureau containing over 48,000 records. We can deliberately remove a significant portion of the age column from this data, blanking out around a third of the total values, with a deliberate bias towards removing older ages from the data.

[00:00:48] In most analyses, this would be a significant problem, especially for basic imputation methods that fill these blanks either with static values or basic approximations, like a mean or a median value from the rest of the column.

[00:01:03] We can explore samples of this original data, seeing the gaps in our age column, confirming the extent of the issue, even plotting a distribution of the original data with these missing ages.

[00:01:16] Let's use MOSTLY AI to load this dataset with the missing values and head into our data settings to set up the imputation.

[00:01:25] Scroll down to the age column and let's configure our settings. We'll say yes to Smart Imputation, and click Save. It's as simple as that.

[00:01:35] We create our synthetic dataset, wait for the model to train and the generated synthetic data to arrive.

[00:01:43] Once the process completes, let's download our results and run a few analyses to see how our synthetic data stacks up against the original.

[00:01:53] We can load our synthetic dataset into our notebook, sample some records in the data, even confirming that there are no longer any gaps in our age column.

[00:02:04] Let's also plot our original distribution with missing values

[00:02:07] against our synthetic data, and it's immediately clear that our new synthetic data distribution is quite different from where we started as a result of smart imputation.

[00:02:20] We can do one more important check at this point, referring back to our original or ground truth dataset that actually contains the full range of age values before they were removed.

[00:02:32] If we load this dataset into our notebook and redraw our plots, we can see that smart imputation from MOSTLY AI is able to model the missing values very closely in order to recover the original distribution in our starting dataset.

[00:02:49] Smart imputation uses the power of our generative AI approach for value replacements in a dataset that's both relevant and highly realistic.

[00:02:59] Visit mostly.ai to sign up for a free account to get started on your synthetic data journey.

[00:03:05] [music]

Ready to try synthetic data generation?

The best way to learn about synthetic data is to experiment with synthetic data generation. Try it for free or get in touch with our sales team for a demo.
magnifiercross