lightbulb

Verify whether your data was synthesized correctly.
Identify the root-causes of potential issues you may find.

list

The QA report is available for synthetic data jobs that are already completed.
If you have not yet generated synthetic data, you can start with the help of our Get started guide.

clock

It will take 15 mins to read this guide.
You’ll get a tour of the web UI’s QA report and learn how to spot and remediate issues.

The card at the top provides general information about the completed synthetic data job.

Job type

shows if the job is of type Ad hoc or Catalog.

Job started

shows when the job started.

Overall accuracy

shows the overall accuracy of the generated synthetic data.

Top QA report card

The card below shows a QA report for each table in the synthetic dataset.

In-app QA report


green 1 Select a table to load its QA report


green 2 Accuracy tests

The accuracy percentage shows how accurately the synthetic data represents the original data.

Univariate

The overall accuracy of the table’s univariate distributions.

Bivariate

The overall accuracy of the table’s bivariate distributions.

Coherence

(Only for linked tables) The temporal coherence of time series data between the original and synthetic data as well as the preservation of the average sequence length (or the average number of linked table records that are related to a subject table record).


green 3 Privacy tests

MOSTLY AI offers empirical evidence that the privacy of the original data subjects has been preserved.

It performs three privacy tests to assert that the synthetic data can be close, but not too close to the original data in order to preserve the privacy of your data subjects:

Nearest neighbor distance ratio

This is the ratio of the first and the fifth nearest-neighbor distances of synthetic data points when measured against the target dataset. It allows you to compare inliers and outliers in the population on an equal base.

Synthetic data points with an NNDR close to 0 are near target points in sparse data regions, i.e., outlier target data points.

The test passes if the synthetic data points are not more similar to outliers in the target, compared to the target data points.

NNDR example


Identical match share

A measure of exact matches between synthetic and original data.

This metric counts the number of identical data points (copies) within the target data and compares it with the number of copies between the target dataset and the synthetic dataset.

The test passes if the number of copies in the synthetic dataset is less (or not significantly more) than within the target data itself.

IMS example
Figure 1. Example target data with multiple identical data points
A high number of NaN values will increases the chance of false positives.


Distance to closest record

A measure of the distances between synthetic records to their closest original records.

For each synthetic data point, this metric looks at the closest data point in the target dataset and compares that distribution of the closest distances to the observed distribution within the target data.

The test passes if, for the synthetic distribution of the closest records, low quantiles are not statistically below target data quantiles.

A threshold is defined for each quantile by the confidence interval generated via bootstrapping on the difference between target data and synthetic data set distribution.


green 4 Dataset statistics

Learn how big the training and synthetic tables are.
Context columns refers to the number of columns in the referenced table.


green 5 Model and Data QA report tabs

The Model QA report and the Data QA report tabs provde charts about the AI training model and the generated synthetic data, respectively.


Correlations

This tab shows three correlation matrices. They provide an easy way to assess whether the synthetic dataset retained the correlation structure of the original data set.

Both the X and Y-axis refer to the columns in your subject table, and each cell in the matrix correlates a variable pair: the more two variables are correlated, the darker the cell becomes. The third matrix shows the difference between the target and the synthetic data.

Correlations


Technical reference

The correlations are calculated by binning all variables into a maximum of 10 groups. For categorical columns, the 10 most common categories are used and for numerical columns, the deciles are chosen as cut-offs. Then, a correlation coefficient Φκ is calculated for each combination of variable pairs. Φκ coefficient provides a stable solution when combining numerical and categorical variables and also captures non-linear dependencies. The resulting correlation matrix is then color-coded as a heatmap to indicate the strength of variable interdependencies, once for the actual and once for the synthetic data with scaling between 0
and 1.


Univariate distributions

Univariate distributions describe the probability of a variable having a particular value. You can find four types of plots in this section of the QA report: categorical , continuous and datetime, but there’s also a Sequence Length plot if you synthesized a linked table.

For each variable, there’s a distribution and binned plot. These show the distributions of the original and the synthetic dataset in green and black, respectively. The percentage next to the title shows how accurately the original column is represented by the synthetic column.

Univariate distributions

You may find categories that are not present in the original dataset (for example, _RARE_). These categories appear as a means of rare category protection, ensuring privacy protection of subjects with rare or unique features.

A bad fit on univariate distributions is often a sign of wrong encoding settings.


Technical reference

All variables are binned into a maximum of 10 groups. For categorical columns, the 10 most common categories are used and for numerical and datetime columns, the deciles are chosen as cut-offs. One additional group is used to show empty values: (empty) for categorical and (n/a) for numerical and datetime columns.

In the downloadable HTML, we display up to 90 univariate charts, half of which are the most accurate and the other half the least accurate.


Bivariate distributions

Bivariate distributions help you understand the conditional relationship between the content of two columns in your original dataset and how it changed in the synthetic dataset.

The bivariate distribution below shows, for instance, that the age group of forty years and older is most likely to be married, and anyone below thirty is most likely to have never been married. You can see that this is the same in the synthetic dataset.

If it’s a QA report for a linked table, then you can find the plots with the context table’s columns by looking for context:[column-name]. The word context here refers to either the subject table or another linked table with which this linked table has been synthesized.

Bivariarate distributions

You may find categories that are not present in the original dataset (for example, _RARE_). These categories appear as a means of rare category protection, ensuring privacy protection of subjects with rare or unique features.


Technical reference

All variables are binned into a maximum of 10 groups. For categorical columns, the 10 most common categories are used and for numerical and datetime columns, the deciles are chosen as cut-offs. One additional group is used to show empty values: (empty) for categorical and (n/a) for numerical and datetime columns.

In the downloadable HTML, only a selection of bivariate plots are shown, half of which are the most accurate and the other half the least accurate.


Accuracy

The accuracy of synthetic data can be assessed by measuring statistical distances between the synthetic and the original data. The metric of choice for the statistical distance is the total variation distance (TVD), which is calculated for the discretized empirical distributions. Subtracting the TVD from 100% then yields the reported accuracy measure. These are being calculated for all univariate and all bivariate distributions. The latter is done for all pair-wise combinations within the target data, as well as between the context and the target. For sequential data, an additional coherence metric is calculated that assesses the bivariate accuracy between the value of a column, and the succeeding value of a column. All of these individual-level statistics are then averaged across to provide a single informative quantitative measure. The full list of calculated accuracies is provided as a separate downloadable file.

Accuracy

Systematic differences between the target and synthetic data are visible as bright lines and blocks. This can be an indicator for high level of rare category masking, or a incorrect encoding configuration.


Remediating privacy and accuracy issues

Identifying the source of data quality or privacy issues can be very difficult.
Below is a list of common issues.


Accuracy

Bad univariate fit

High number of N/As

High amount of rare category labels

Wrong encoding type

Incorrect sequence length

Too high batch size on linked table

High number of business rules violations

Training goal is set to Speed
instead of Accuracy

Privacy

  • High amount of NaN can make the Identical match share fail.

  • Privacy tests can contain false positives because of sampling and stochastic tests.

  • In case of good accuracy, repeating a synthesization and testing privacy again makes sense.

Spotting potential issues


Numerical encoding of categorical values

For MOSTLY AI’s Auto-detect encoding type setting, values such as ZIP codes can be indistinguishable from continuous values. This results in the generation of invalid ZIP codes and difficult-to-learn business rules.

The solution is to change the encoding type to Categorical.

Incorrect numerical example


Incorrect datetime format

Incorrect formatting of a date column results in it being encoded as a categorical column. Below is an example of an incorrectly formatted, and thus incorrectly encoded, deathDate column.

Incorrect datetime example


Sequential correlations are lost

If the correlations between two sequential events become lost (left example) or weakened (right example) in the synthetic data, we recommend to use an L model and set the training goal to Accuracy.

Sequential correlations example