Data augmentation
Smart imputation

Smart imputation

If a column in your original data has missing values, you can use Smart imputation to impute the column for the synthetic data.

In the steps below, we demonstrate the Smart imputation feature with a US Census dataset, from which we removed some variables in the age column to create missing values. When you enable the feature in MOSTLY AI, Smart imputation will fill the missing values in the synthetic data.

Steps

  1. Click Create synthetic data.

    Smart imputation - Step 1
  2. Download the US Census dataset, upload it, and click Proceed.

    Smart imputation - Step 2
  3. Click the Data settings tab, find the age column, and click on the cog icon to open the column settings drawer.

    Smart imputation - Step 3
  4. Enable Smart Imputation and click Save when done.

    Smart imputation - Step 3
  5. Click Create a synthetic dataset.

    Smart imputation - Step 5

Result

The below univariate plot of the age column shows that Smart Imputation is able to recover the distribution as if the missing values weren’t missing. This plot shows the following four distributions:

  • Original US Census dataset with all values available.
  • Original US Census dataset with random missing values in the age column.
  • Synthetic US Census dataset of the version with missing values.
  • Synthetic US Census dataset with Smart Imputation applied on the version with missing values.
Smart imputation - Step 4

As you can see, the average age in the version with missing values is skewed in favor of younger people (36.8 years vs. 37.2 years). The synthetic version of this table has a similar result. On the other hand, the synthetic version with Smart Imputation applied was able to reconstruct the missing values in the age column and thus recover the original distribution, which has an average age of 39.

Methodology

The age column’s missing values in the example US Census dataset were artificially created. The following records were randomly set to missing to bias the non-missing values towards younger age segments:

  • 10% of all records
  • 60% of records, whose education level was either Doctorate, Prof-school or Masters
  • 60% of records, whose marital status was either Widowed or Divorced
  • 60% of records, whose occupation level was set to Exec-managerial

It’s important to note that by doing this, the algorithm won’t be able to find any patterns or rules of where the missing values are located.