[00:00:00] Let's talk about dealing with datasets that contain missing values.
[00:00:04] This can be a real challenge, especially if the remaining values we have end up creating a distorted or biased view about overall distribution.
[00:00:14] With MOSTLY AI, we can help close the gaps in your data through a process we call smart imputation. Let's take a look with a quick example.
[00:00:23] Suppose I take a known dataset, such as the famous UCI Adult Income Survey from the US Census Bureau containing over 48,000 records. We can deliberately remove a significant portion of the age column from this data, blanking out around a third of the total values, with a deliberate bias towards removing older ages from the data.
[00:00:48] In most analyses, this would be a significant problem, especially for basic imputation methods that fill these blanks either with static values or basic approximations, like a mean or a median value from the rest of the column.
[00:01:03] We can explore samples of this original data, seeing the gaps in our age column, confirming the extent of the issue, even plotting a distribution of the original data with these missing ages.
[00:01:16] Let's use MOSTLY AI to load this dataset with the missing values and head into our data settings to set up the imputation.
[00:01:25] Scroll down to the age column and let's configure our settings. We'll say yes to Smart Imputation, and click Save. It's as simple as that.
[00:01:35] We create our synthetic dataset, wait for the model to train and the generated synthetic data to arrive.
[00:01:43] Once the process completes, let's download our results and run a few analyses to see how our synthetic data stacks up against the original.
[00:01:53] We can load our synthetic dataset into our notebook, sample some records in the data, even confirming that there are no longer any gaps in our age column.
[00:02:04] Let's also plot our original distribution with missing values
[00:02:07] against our synthetic data, and it's immediately clear that our new synthetic data distribution is quite different from where we started as a result of smart imputation.
[00:02:20] We can do one more important check at this point, referring back to our original or ground truth dataset that actually contains the full range of age values before they were removed.
[00:02:32] If we load this dataset into our notebook and redraw our plots, we can see that smart imputation from MOSTLY AI is able to model the missing values very closely in order to recover the original distribution in our starting dataset.
[00:02:49] Smart imputation uses the power of our generative AI approach for value replacements in a dataset that's both relevant and highly realistic.
[00:02:59] Visit mostly.ai to sign up for a free account to get started on your synthetic data journey.
[00:03:05] [music]