If the original column contains missing data, you can use Smart imputation to impute them for the synthetic data.
Follow these steps to Smart impute your data
Click Create synthetic data to begin |
|
Download the US Census dataset , upload it, and click Proceed |
|
Click on the Data settings tab, find the age column, and click on the cog icon to open the column settings drawer. |
|
Enable Smart Imputation and click Save when done. |
|
Click Launch job to synthesize |
|
Results
The below univariate plot of the age column shows that Smart Imputation is able to recover the distribution as if the missing values weren’t missing. This plot shows the following four distributions:
-
Original US Census dataset with all values available.
-
Original US Census dataset with random missing values in the age column.
-
Synthetic US Census dataset of the version with missing values.
-
Synthetic US Census dataset with Smart Imputation applied on the version with missing values.
As you can see, the average age in the version with missing values is skewed in favor of younger people (36.8 years vs. 37.2 years). The synthetic version of this table has a similar result. On the other hand, the synthetic version with Smart Imputation applied was able to reconstruct the missing values in the age column and thus recover the original distribution, which has an average age of 39.
Methodology
The age column’s missing values in the example US Census dataset were artificially created. The following records were randomly set to missing to bias the non-missing values towards younger age segments:
-
10% of all records
-
60% of records, whose education level was either
Doctorate
,Prof-school
orMasters
-
60% of records, whose marital status was either
Widowed or `Divorced
-
60% of records, whose occupation level was set to
Exec-managerial
It’s important to note that by doing this, the algorithm won’t be able to find any patterns or rules of where the missing values are located.