Privacy-protection mechanisms

Privacy-protection mechanisms

With synthetic data, you can unlock the utility of your original data and at the same time protect the privacy of your subjects. While specific cases in the original data might increase the risk of re-identification, MOSTLY AI employs a number of privacy-protection mechanisms to avoid such risks.

Generated synthetic data does not have 1:1 relation to the original data

No relation between original and synthetic data

In contrast to anonymization techniques, MOSTLY AI uses your original data only as a learning material to train generative AI models. During training, the models learn the patterns, distributions, correlations, and other statistical characteristics of your original data. MOSTLY AI then uses the AI models to generate synthetic data from scratch.

As a result, the synthetic data bears no 1:1 relationship with the original data and no direct surface exists for attacks to re-identify sensitive information.

Probabalistic model with random draws

image:concepts:privacy/probabalistic-model.svg[Probabalistic model,650]

Extreme sequence length protection enabled

MOSTLY AI uses an unsupervised machine learning process to train a deep neural network with the original data as input. The resulting AI model contains only the statistical characteristics of the original data without any personal information. To create the synthetic data, MOSTLY AI performs random draws against the AI model.

For an overly simplified example, you can consider how the process of random draws works for a single column from the original data. For example, a single column might contain the categorical variable Gender with distinct values Male, Female, Other and N/A.

One of the many statistical characteristics that the model learns is the distribution of the variables. For example, the original data might have the distibution of 47% females, 45% males, 7% other, and 3% N/As. When MOSTLY AI randomly draws a synthetic data point for this column, the result is Male in about 4-5 times out of 10.

As mentioned, this example is overly simplifed as during each random draw, MOSTLY AI takes into account not only the distribution of a single column, but all statistical characteristics and the relationships that exist between each column of the original data.

Due to the probabalistic nature of this process, it is impossible to predict the result of a random draw. In effect, the process naturally introduces some noise which also results in privacy preservation.

Prevention of data overfitting

Prevention of overfitting

In terms of privacy protection, data overfitting can mean that, during training, the AI model learns not only general patterns but also actual information that includes privacy-sensitive data.

During training, MOSTLY AI applies a mechanism to prevent the generative AI model from memorizing specific individual properties and patterns. The MOSTLY AI loss function and validation criteria are designed to achieve generalization and avoid overfitting.

Value protection mechanisms

Value protection includes a number of mechanisms that protect against re-identification in cases such as rare categories, extreme values, and extreme sequence lengths.

Before MOSTLY AI runs the training of the generative AI model with your original data, it applies the value protection mechanisms. This step is crucial to safeguard against membership inference attacks. Although the model cannot learn from a single or small group of subjects, it can still generate a subject with a rare category or extreme numerical value that could be traced back to the original dataset. The mere knowledge that a customer was part of the original dataset constitutes a privacy breach.

Rare category protection

Rare category protection

MOSTLY AI applies Rare category protection to categorical columns. This is a safeguard that prevents the training of the AI model with rare values. To maintain the correlation and distribution of the original data, MOSTLY AI substitutes such values with the category value \_RARE_.

You can exclude such rare categories from the generated synthetic data. However, this alters the distribution in the synthetic data.

For example, consider a job title such as President of the United States. Although the AI model cannot learn from a single individual, it is trained that a non-zero probability exists of encountering a job title that equals President of the United States.

When MOSTLY AI removes this value before training, it ensures that the value never appears in the generated synthetic data.

Extreme value protection

Extreme value protection

MOSTLY AI applies Extreme value protection to numerical and date-time columns.

Before training, MOSTLY AI removes extreme values from the data distribtution of such columns. This mechanism ensures that the generated synthetic data does not reveal exceptional cases, such as a 130-year-old person or an entrepreneur with a net worth of 186.5 billion USD.

Extreme sequence length protection

Extreme sequence length protection

The Extreme sequence length protection algorithm removes excessive numbers of linked records that lead back to a subject in a subject table. As such, long sequence lengths can also jeopardize privacy.

Therefore, we make sure to remove such sequences before the training phase.

Privacy settings by default

By default, all configuration settings for data synthesis with MOSTLY AI prioritize privacy over accuracy.

Rare category protection is on

MOSTLY AI enables the option Rare category protection by default for all categorical columns.

Rare category protection enabled

Extreme value protection is on

MOSTLY AI enables the option Extreme value protection by default for all numeric and date-time columns.

Extreme sequence length protection enabled

Extreme sequence length protection is always on

MOSTLY AI detects extreme sequence length before training and removes them automatically. The option is not available for control from the user interface.

No tables are copied "as-is"

MOSTLY AI synthesizes the data of all subject and linked tables from the original dataset.

This excludes all reference tables. During training and data synthesis, MOSTLY AI takes into account the correlations that exists with the data in reference table, but the reference table data is never copied to the destination (file or database).

Extreme sequence length protection enabled