If the original column contains missing data, you can use Smart imputation to impute them for the synthetic data.

mostly arrow Follow these steps to Smart impute your data

green 1

Click Create synthetic data to begin

step 1

green 2

Download the US Census dataset , upload it, and click Proceed

step 2 imputation

green 3

Click on the Data settings tab, find the age column, and click on the cog icon to open the column settings drawer.

data settings

green 4

Enable Smart Imputation and click Save when done.

Column settings

green 5

Click Launch job to synthesize

step 3 imputation

mostly arrow3 Results

The below univariate plot of the age column shows that Smart Imputation is able to recover the distribution as if the missing values weren’t missing. This plot shows the following four distributions:

  • Original US Census dataset with all values available.

  • Original US Census dataset with random missing values in the age column.

  • Synthetic US Census dataset of the version with missing values.

  • Synthetic US Census dataset with Smart Imputation applied on the version with missing values.

results plots

As you can see, the average age in the version with missing values is skewed in favor of younger people (36.8 years vs. 37.2 years). The synthetic version of this table has a similar result. On the other hand, the synthetic version with Smart Imputation applied was able to reconstruct the missing values in the age column and thus recover the original distribution, which has an average age of 39.

Methodology

The age column’s missing values in the example US Census dataset were artificially created. The following records were randomly set to missing to bias the non-missing values towards younger age segments:

  • 10% of all records

  • 60% of records, whose education level was either Doctorate, Prof-school or Masters

  • 60% of records, whose marital status was either Widowed or `Divorced

  • 60% of records, whose occupation level was set to Exec-managerial

It’s important to note that by doing this, the algorithm won’t be able to find any patterns or rules of where the missing values are located.