MOSTLY AI constructs fictional characters from your data subjects. These fictional characters resemble your data subjects as close as possible — close enough to be realistic, but with the right distance to prevent reidentification.
Once this synthetic dataset is generated, MOSTLY AI will test how accurately this dataset retains the statistical properties of the original while also testing whether the privacy of individual subjects is preserved.
Synthetic data privacy and accuracy are closely-related concepts. They both are a measure of dissimilarity to the original dataset. To better understand this relationship, let’s consider a synthetic dataset that is 100% accurate to the original.
Such a score would only be possible if the machine-learning algorithm failed to learn and generalize the original data’s features properly. The resulting dataset would be, at best, a mixed-up version of the original while still retaining information on actual individuals. A 100% accurate synthetic dataset is, therefore, not at all privacy-preserving.
For that reason, MOSTLY AI has to consider the maximum achievable accuracy for a given dataset. You can find this figure by dividing the original dataset into two equally-sized datasets, plotting out their statistical properties, and measuring the variance between them. You would see that these two halves somewhat deviate from each other, resulting in less than 100% accuracy.
MOSTLY AI offers empirical evidence that the privacy of the original data subjects has been preserved.
It performs three privacy tests to assert that the synthetic data can be close, but not too close to the original data in order to preserve the privacy of your data subjects:
Identical Match Share — a measure of exact matches between synthetic and original data.
Distance to Closest Record — a measure of the distances between synthetic records to their closest original records.
Nearest Neighbor Distance Ratio — a calculation of the distance ratio, that is the DCR normalized by the distances to other neighboring records. This allows to compare inliers and outliers in the population on an equal base.
This is shown as follows in the QA report:
The DCR and NNDR tests have distribution charts. The distances for the synthetic data displayed in green, and the distances for the original data displayed in gray. A green line that is significantly left of the gray line within the cumulative density plots implies that the generated data is too close to the actual records, and the privacy test would fail.
If the dataset failed one or more privacy checks, then you can try the following remedial actions:
In your job’s general settings, if you specified the number of training subjects in your original job, then you can try to increase this value.
When configuring your tables, you can:
Optimize forsetting from
reduce the number of training epochs by specifying a lower number than the number of epochs used for training the synthetization model.
The accuracy chart compares the synthetic and target data, showing the similarity of the univariate and bivariate distributions. The darker the cells the better these distributions are kept in the synthetic data.
The synthetic data’s overall accuracy is based on the L1 distance (also known as Manhattan distance or Taxicab geometry) between synthetic and target data, as measured across all empirical univariate and bivariate probability distributions. The overall accuracy is then reported as a median, together with its range, across all calculated metrics.
Please note that even a true holdout sample is not expected to reach 100% due to the nature of random sampling. But any measure in the range of 96%-99% can already be considered of very high utility.
The L1 distance (L1D) for this particular combination of variables is defined as the sum of the absolute differences in relative frequencies and normalized by the maximum possible difference. The accuracy values in the matrix are then simply defined as 1 - L1D and the overall accuracy measure is the median of all accuracies. In the brackets, we show the range i.e., the min and max, across all calculated accuracies.
The second section of the QA report shows three correlation matrices. They provide an easy way to visually assess whether the synthetic dataset retained the correlation structure of the original data set.
Both the X and Y-axis refer to the columns in your subject table, and each cell in the matrix correlates a variable pair: the more two variables are correlated, the darker the cell becomes. The third matrix shows the difference between the target and the synthetic data.
The correlations are calculated by binning all variables into a maximum of 10 groups. For categorical columns, the 10 most common categories are used and for numerical columns, the deciles are chosen as cut-offs. Then, a correlation coefficient Φκ is calculated for each combination of variable pairs. Φκ coefficient provides a stable solution when combining numerical and categorical variables and also captures non-linear dependencies. The resulting correlation matrix is then color-coded as a heatmap to indicate the strength of variable interdependencies, once for the actual and once for the synthetic data with scaling between 0 and 1.
Univariate distributions describe the probability of a variable having a particular value. You can find four types of plots in this section of the QA report: categorical variables, continuous variables, but also a sequence length, and an inter-transaction time plot if you synthesized a time-series dataset.
The plots show the distributions of the original and the synthetic dataset, which are shown in green and black, respectively. The percentage next to the title shows how accurately the original column is represented by the synthetic column.
You may find categories that are not present in the original dataset (for example,
RARE). These categories appear as a means of rare category protection, ensuring privacy protection of subjects with rare or unique features.
The third section of the QA report shows the bivariate distributions for all variable pairs in your dataset. It helps you understand the conditional relationship between the content of two columns in your original dataset and how it changed in the synthetic dataset.
The bivariate distribution below shows, for instance, that the age group of forty years and older is most likely to be married, and anyone below thirty is most likely to have never been married. You can see that this is the same in the synthetic dataset.
For the sake of visualization, we apply the same 10-way binning as for the correlations and the overall number of plots is limited to 12. If there are more than 12 plots, the ones with the highest correlations are selected.
The QA report’s accuracy chart compares the synthetic and target data, showing the similarity of the univariate and bivariate distributions. The teal-colored cells represent well-kept distributions (accuracy of 90% or more), whereas light-grey colored cells show a systematic difference between synthetic and target data.
You may improve the accuracy of poorly covered features by selecting a better-suited encoding type or refining their configuration. To do so, you can use the following points as a guide:
For columns with numerical values, please consider whether they represent a fixed set of possible values, such as postal codes or clothing sizes, or whether they may vary in the synthetic data, such as weight and height. Use the Categorical or Numerical encoding type, respectively.
Date and time values may not have been detected by MOSTLY AI. They could thus have been synthesized using the
Categoricalinstead of the
Datetimeencoding type. You can verify whether this is the case by looking for rare category protection artifacts, such as the
*label in the bivariate distributions.To mitigate this issue, consider preprocessing these values so that they adhere to the supported datetime formats.