The QA report contains univariate and bivariate plots of the target data. These may contain rare categories and extreme values that can reveal personal data of outliers. Please review the QA report before sharing it.

1. Introduction

MOSTLY AI constructs fictional characters from your data subjects. These fictional characters resemble your data subjects as close as possible — close enough to be realistic, but with the right distance to prevent reidentification.

Once this synthetic dataset is generated, MOSTLY AI will test how accurately this dataset retains the statistical properties of the original while also testing whether the privacy of individual subjects is preserved.

Synthetic data privacy and accuracy are closely-related concepts. They both are a measure of dissimilarity to the original dataset. To better understand this relationship, let’s consider a synthetic dataset that is 100% accurate to the original.

Such a score would only be possible if the machine-learning algorithm failed to learn and generalize the original data’s features properly. The resulting dataset would be, at best, a mixed-up version of the original while still retaining information on actual individuals. A 100% accurate synthetic dataset is, therefore, not at all privacy-preserving.

For that reason, MOSTLY AI has to consider the maximum achievable accuracy for a given dataset. You can find this figure by dividing the original dataset into two equally-sized datasets, plotting out their statistical properties, and measuring the variance between them. You would see that these two halves somewhat deviate from each other, resulting in less than 100% accuracy.

1.1. How MOSTLY AI determines whether privacy is preserved

MOSTLY AI offers empirical evidence that the privacy of the original data subjects has been preserved.

By comparing the original dataset to the synthetic version, it evaluates whether actual information from the original dataset was retained (Identical Match Share), whether it doesn’t resemble the original dataset too closely (Distance to Closest Record), and whether outliers cannot be reidentified (Nearest Neighbor Distance Ratio).

The section below provides further details on these analyses.

2. Privacy

2.1. Identical Match Share (IMS)

The IMS shows the ratio of synthetic/holdout data points that occur in the target data.

Unless the holdout data itself gives identical matches (indicating repeated elements in the target), one should not see real data points appear in the synthetic data.

TEST: the IMS of the synthetic data is not significantly larger than the IMS of the holdout.

What follows are two exemplary snapshots of our QA report. Left: perfect IMS. Right: data should be revised as we see a high synthetic match ratio even when compared to the reference.

IMS good result

IMS bad result

2.2. Distance to Closest Record (DCR)

First, we calculate the distance of each synthetic data point to the closest target data point. The DCR plots show the density estimates for these distance distributions. Our distance metric takes different data types into account (numeric, categorical, date-time). As the actual distance values are hard to interpret for high-dimensional tables with mixed column types, we transform the distances to lie between 0 and 1 in all cases by cutting extremely large outlier distances and normalizing the values.

TEST: the synthetic DCR distribution is not significantly skewed towards zero compared to the holdout.

With accurate synthetic data, we expect the holdout and synthetic plots to be close to each other. On the left, you’ll find a DCR distribution of good synthetic data. The spike at the right is an artifact of the normalization. On the right, you’ll find an distribution of the synthetic data that is generated by adding noise to the target. There would be no exact matches but the DCR is heavily shifted towards zero indicating bad synthetic data.

DCR good result

DCR bad result

2.3. Nearest Neighbor Distance Ratio (NNDR)

The NNDR plots show the ratio of the closest and second closest distances of synthetic data points when measured against the target data set. Synthetic data points with NNDR close to 0 are near target points in sparse data regions i.e., outlier target data points.

NNDR example

In our QA report, we plot density estimates of the NNDR distributions. With accurate synthetic data, we expect that the plot based on the holdout and synthetic samples to be close.

TEST: the synthetic NNDR distribution is not significantly skewed towards zero compared to the holdout.

NNDR good result

NNDR bad result

2.4. Remediating datasets that failed the privacy checks

If the dataset failed one or more privacy checks, then you can try the following remedial actions:

  • In your job’s general settings, if you specified the number of training subjects in your original job, then you can try to increase this value.

  • When configuring your tables, you can:

    • switch the Optimize for setting from Accuracy to Speed.

    • reduce the number of training epochs by specifying a lower number than the number of epochs used for training the synthetization model.

  • Super Admins can reduce the number of network parameters by reducing the number of regressor, history, and context units in the global job settings.

3. Accuracy

3.1 Univariate Distributions

The univariate plots of the target data may contain extreme values that can reveal personal data of outliers. Please review them before sharing the QA report.

Univariate distributions describe the probability of a variable having a particular value. You can find four types of plots in this section of the QA report: categorical variables, continuous variables, but also a count plot, and an inter-transaction time plot if you synthesized a time-series dataset.

Each plot consists of a title, x-axis, a y-axis, and the plot itself. Continuous variables also have a box-plot below the y-axis, which shows the overall distribution and skewness of the data.

Title

The title of the plot corresponds to the column name in your table. If your dataset contains multiple tables, and some of the columns have identical names, then each plot name will be numbered respectively to the order that the tables were uploaded.

Categorical variables

Plots that describe event table columns have the number 1 in the title. MOSTLY AI plots the univariate distribution of the first event in the sequences of your subjects. This number refers to this first event.

X-axis

For categorical variables, the x-axis describes in percentages how often a particular category appears in that column. For continuous variables, it sets out the range of values.

Y-axis

The Y-axis only exists for categorical variables and sets out the column’s available categories. For continuous variables, it’s the box plot that indicates how the values are distributed in the column.

Categorical variables

You may find categories that are not present in the original dataset (for example, *`). These categories appear as a means of rare category protection, ensuring privacy protection of subjects with rare or unique features.

example subject table

Plot

The plot shows the distributions of the original and the synthetic dataset, which are shown in green and black, respectively.

Box plot

Box plots divide the column’s data into sections that each contain approximately 25% of the data in that set. They provide a visual summary that helps you identify how the data is distributed.

example subject table

For skewed distributions, zeros or missing values are shown as interruptions in the lower whisker (the black line) of the box plot.

example subject table

3.2 Correlations

The second section of the QA report shows two correlation matrices. They provide an easy way to visually assess whether the synthetic dataset retained the correlation structure of the original data set.

Both the X and Y-axis refer to the columns in your subject table, and each cell in the matrix correlates a variable pair: the more two variables are correlated, the darker the cell becomes.

correlations

Technical reference

The correlations are calculated by binning all variables into a maximum of 10 groups. For categorical columns, the 10 most common categories are used and for numerical columns, the deciles are chosen as cut-offs. Then, a correlation coefficient Φκ is calculated for each combination of variable pairs. Φκ coefficient provides a stable solution when combining numerical and categorical variables and also captures non-linear dependencies. The resulting correlation matrix is then color-coded as a heatmap to indicate the strength of variable interdependencies, once for the actual and once for the synthetic data with scaling between 0 and 1.

3.3 Bivariate distributions

The bivariate plots of the target data may contain rare categories that can reveal personal data of outliers. Please review them before sharing the QA report.

The third section of the QA report shows the bivariate distributions for all variable pairs in your dataset. It helps you understand the conditional relationship between the content of two columns in your original dataset and how it changed in the synthetic dataset.

The bivariate distribution below shows, for instance, that the age group of forty years and older is most likely to be married, and anyone below thirty is most likely to have never been married. You can see that this is the same in the synthetic dataset.

Example bivariarate distributions

Technical reference

For the sake of visualization, we apply the same 10-way binning as for the correlations and the overall number of plots is limited to 12. If there are more than 12 plots, the ones with the highest correlations are selected.

3.4 Accuracy

The last page of the report presents an accuracy chart and a measure of the synthetic data’s overall accuracy.

The accuracy chart compares the synthetic and target data, showing the similarity of the univariate and bivariate distributions. The teal-colored cells represent well-kept distributions (accuracy of 90% or more), whereas light-grey colored cells show a systematic difference between synthetic and target data.

The synthetic data’s overall accuracy is based on the L1 distance (also known as Manhattan distance or Taxicab geometry) between synthetic and target data, as measured across all empirical univariate and bivariate probability distributions. The overall accuracy is then reported as a median, together with its range, across all calculated metrics.

Please note that even a true holdout sample is not expected to reach 100% due to the nature of random sampling. But any measure in the range of 96%-99% can already be considered of very high utility.

Example accuracy plot

Technical reference

The L1 distance (L1D) for this particular combination of variables is defined as the sum of the absolute differences in relative frequencies and normalized by the maximum possible difference. The accuracy values in the matrix are then simply defined as 1 - L1D and the overall accuracy measure is the median of all accuracies. In the brackets, we show the range i.e., the min and max, across all calculated accuracies.

3.5 Remediating columns with poor synthetic data accuracy

The QA report’s accuracy chart compares the synthetic and target data, showing the similarity of the univariate and bivariate distributions. The teal-colored cells represent well-kept distributions (accuracy of 90% or more), whereas light-grey colored cells show a systematic difference between synthetic and target data.

You may improve the accuracy of poorly covered features by selecting a better-suited encoding type or refining their configuration. To do so, you can use the following points as a guide:

  • For columns with numerical values, please consider whether they represent a fixed set of possible values, such as postal codes or clothing sizes, or whether they may vary in the synthetic data, such as weight and height. Use the Categorical or Numerical encoding type, respectively.

  • Date and time values may not have been detected by MOSTLY AI. They could thus have been synthesized using the Categorical instead of the Datetime encoding type. You can verify whether this is the case by looking for rare category protection artifacts, such as the * label in the bivariate distributions.To mitigate this issue, consider preprocessing these values so that they adhere to the supported datetime formats.

  • If the original data has a low number of subjects, consider adjusting the rare category protection accordingly. However, please be aware that setting the threshold lower than 20 may introduce privacy risks.