[00:00:00] Hi and welcome to this tutorial on Smart Imputation. In this tutorial you will learn how to deal with missing values in your data set using MOSTLY AI's Smart Imputation feature. As you probably know, dealing with missing values in your data set can be quite a challenge, especially if these missing values are not statistically representative of the population you're trying to study.
[00:00:23] Sometimes missing values are actually important information, right? So you might have a death date column in your data set and if it is empty for most of the people that are in your data set, it's safe to assume that these people are still alive and that therefore the missingness of this value is correct and actually important information.
[00:00:39] But many times missing values can actually be a problem. This may be a problem with data collection: some people may not have responded to a specific question in your survey or there may be technical or ethical or other issues that have prevented this specific value from being collected properly. This means that the data that you now have is not entirely representative of the population you're trying to study and as a data analyst this can be a problem.
[00:01:09] This can also be a problem for downstream machine learning models who are not able to handle missing values this means you need to either impute the missing values - ideally in as smart and statistically representative way as possible - or you need to leave these records out of your analysis which leads to significant data value loss.
[00:01:26] In this tutorial you will learn how to use MOSTLY AI's Smart Imputation feature to close these kinds of missing value gaps in your data. You will start with an original data set which has a significant portion of missing values for one of the columns. You will then create a synthetic version of this data set using MOSTLY AI and enable the Smart Imputation feature in order to get rid of these missing values in a statistically representative way.
[00:01:49] With this smartly imputed synthetic data set you can then proceed to analyze the population as if the missing values were not there in the first place and get insights from your analysis that are more reliable and less biased than what you would have gotten from the raw, incomplete data.
[00:01:59] present in the first place. As you'll see by the end of this tutorial using the Smart Imputation feature enables you to uncover accurately the original distribution of the population.
[00:02:08] This means you're essentially solving the problem of missing values. For this tutorial, you'll be working with a modified version of the UCI Adult Income data set. We have tweaked this data set a little bit for this tutorial in order to have 30% missing values for the age column.
[00:02:26] These missing values were assigned randomly but with a specific bias so that we ended up missing more values for the older parts of the population. Let's start by taking a look at this original data set with missing values that we're starting with and then we'll jump into synthesizing and using the Smart Imputation feature.
[00:02:43] If you're running this in Google Colab like me you can just run the cell, if you're running locally just make sure that the repo variable contains the right path to your code. Load the data set with the missing values in place and sample 10 random records. We see here that we have some missing values for the age column as expected.
[00:03:07] Let's report the share of missing values for the entire column - we see that we have around 33% of missing values for the age column. We can then plot the distribution and we see here the age distribution of the original data set, including the missing values.
[00:03:27] This is not the original UCI Adult Income data set but the modified version of the data set that we've created with around 33% of missing values for the age column.
[00:03:37] Let's now go to our MOSTLY AI account and synthesize this data set and enable the Smart Imputation feature to see how this affects the missing values. First, let's start by downloading the data set by clicking on the link using control or command s depending on your operating system to save this to disk then navigate to your MOSTLY AI account and let's launch a new synthetic data job and upload the file that we've just downloaded here.
[00:04:10] Click proceed. Once this been uploaded we can navigate to data settings and go to the
[00:04:16] age column and enable the Smart Imputation feature. Here, as you can read, Smart imputation allows you to generate data without any missing values even if these existed within the original data. Enabled, save and then go ahead and launch job. Now I've already launched this job before and it's completed so I can go ahead and download the synthetic data here as CSV.
[00:04:47] Now before we return to our notebook we can also take a look at the QA reports here and this gives you a nice overview of the quality of the synthetic data and it's split into two parts so we have a model QA report which reports on the underlying model and a data QA report which reports on the quality of the data that's been outputed.
[00:05:00] So the model QA report reports on the accuracy and statistical qualities of the underlying model that has been trained on the training data that we have provided. So in this case we should see a nice match between the distributions of the data that we put in which included about 33% missing values for the age column and the distributions that the model has learned from this.
[00:05:28] If we go to the univariate distributions and go down to the age column, we see in dark gray the original distribution and in bright green the synthetic one. We see that these match up nicely and both include a share of missing values around 33-34%.
[00:05:50] Now if we go to the data QA report this reports on the actual data that was output including any programmable features, such as Smart Imputation. So while the original data set had 33% missing values for the age column, in the outputed data we requested specifically for these missing values to be removed in a statistically representative manner using the Smart Imputation feature.
[00:06:13] If we go to the univariate distributions here and go to age and we see - let's start here with missingness - that the original data set has around 33% missing values for the age column and the synthetic version of the data set as zero. This is correct and this is exactly what we asked for.
[00:06:32] If we now take a look at the left hand plot we see that the
[00:06:37] effect of removing this missingness is that our distribution has shifted to the right. So in dark gray we see the original data set distribution with the missing values and in light green we see the synthetic distribution in which the missing values have been imputed and we see that this shifts the overall distribution to the right by accounting for more of the older segments of the population.
[00:07:01] Now we can return to our code and upload the synthetic data file here if you're running a colab. And we can then proceed to sample 10 random records and see here that we have - at least in this small sample - no missing values for the age column. We can confirm this by running the next cell to see that we have 0% of values missing for the age column.
[00:07:32] Let's now plot side by side the original data distribution with the 33% missing values and the imputed synthetic data. This basically replicates what we saw in the QA report, right, so we have this dark gray graph here which is the original data that we trained on which included 33% missing values for the age column and then the imputed synthetic data which shifts to the right.
[00:07:58] Now what would be really interesting is to compare this with the real original UCI Adult Income data set which was not tampered with, so that has no missing values for the age column to see how that distribution compares to the synthetically imputed distribution that we've created using MOSTLY AI. Of course in the ideal scenario, this would match up quite nicely.
[00:08:22] So let's run the next cell and we see here that we have in black again the original distribution with the 33% missing values and then in red the ground truth, so the original UCI Adult Income data set and in light green MOSTLY AI's smartly imputed synthetic data set which matches very nicely the original ground truth data set.
[00:08:44] Now this is quite impressive because MOSTLY AI had no access to this ground truth data set but just by looking at the data and the correlations between all the different columns, the Smart Imputation feature was able to uncover the
[00:08:56] underlying distribution of the original population with impressive accuracy. From here, as a data analyst, you can now proceed to reason about and analyze the underlying original population as if there were no missing values in the first place. This is really powerful and valuable information and what's more because this is high quality synthetic data you can now share this data exploration and analysis with more and more people because you do not have to be worried about revealing private information.
[00:09:26] So not only have you dealt with the problem of missing values in a statistically representative and accurate way but you've also increased the transparency of your data process. And with that we come to the end of this tutorial in this tutorial you have learned how to deal with missing values in your data set using MOSTLY AI's Smart Imputation feature. You've seen that by synthesizing an original data set with missing values and enabling the Smart Imputation feature, you're able to uncover the original distribution of the population as if it had no missing values in the first place.
[00:10:00] This can solve a big part of the problem of working with missing values and it also increases the transparency and sharability of your data analysis process as the data that you're now working with is synthetic and has significantly reduced privacy concerns. If you'd like to learn more about how the smart imputation technique works and how it compares to other imputation techniques, I recommend checking out this blog post on our website in which we compare six different types of imputation techniques.
[00:10:29] If you work through the code in this notebook and run into any questions or things that you would like to share, please don't hesitate to reach out to us. We always love hearing from our users. Thank you so much for your time and see you in the next one!