Smart imputation
If a column in your original data has missing values, you can use Smart imputation to impute the column for the synthetic data.
In the steps below, we demonstrate the Smart imputation feature with a US Census dataset, from which we removed some variables in the age column to create missing values. When you enable the feature in MOSTLY AI, Smart imputation will fill the missing values in the synthetic data.
Steps
-
Click Create synthetic data.
-
Download the US Census dataset, upload it, and click Proceed.
-
Click the Data settings tab, find the age column, and click on the cog icon to open the column settings drawer.
-
Enable Smart Imputation and click Save when done.
-
Click Create a synthetic dataset.
Result
The below univariate plot of the age column shows that Smart Imputation is able to recover the distribution as if the missing values weren’t missing. This plot shows the following four distributions:
- Original US Census dataset with all values available.
- Original US Census dataset with random missing values in the age column.
- Synthetic US Census dataset of the version with missing values.
- Synthetic US Census dataset with Smart Imputation applied on the version with missing values.

As you can see, the average age in the version with missing values is skewed in favor of younger people (36.8 years vs. 37.2 years). The synthetic version of this table has a similar result. On the other hand, the synthetic version with Smart Imputation applied was able to reconstruct the missing values in the age column and thus recover the original distribution, which has an average age of 39.
Methodology
The age column’s missing values in the example US Census dataset were artificially created. The following records were randomly set to missing to bias the non-missing values towards younger age segments:
- 10% of all records
- 60% of records, whose education level was either Doctorate, Prof-school or Masters
- 60% of records, whose marital status was either Widowed or Divorced
- 60% of records, whose occupation level was set to Exec-managerial
It’s important to note that by doing this, the algorithm won’t be able to find any patterns or rules of where the missing values are located.