Generative AI mimics data so well that you can end up with a 1:1-like connection to your original data. The important underlying concept of synthetic data is that there are no 1:1 relationships between the original and the synthetic data. The real data is only used as learning material during the synthesization process. Only generalizable patterns, distributions or correlations are learned. MOSTLY AI’s platform generates synthetic data from scratch based on these patterns. There is no 1-to-1 link between original and synthetic data. Because of this missing 1-to-1 link, there is no direct attack surface for re-identifying sensitive information.
However, it is essential to point out that not all synthetic data is created equal. There are open source solutions out there without additional privacy mechanisms in place that can leak privacy. The process of synthesization does not guarantee privacy in itself. One of the possible issues is outliers or extreme values that can easily be re-identified.
MOSTLY AI’s platform uses different mechanisms to safeguard against privacy and re-identification risks. The first mechanism makes sure that our deep learning algorithm will not overfit the original data. The second mechanism is built-in privacy protections on all levels. We automatically disable all categories used by a few sets of individuals and protect extreme values in other data types as there could be a privacy risk. The third mechanism is the quality assurance report after generation. We evaluate the model and each batch of generated synthetic data with strict privacy metrics to detect any and all privacy risks.