MOSTLY AI employs a number of privacy-protection mechanisms to ensure that the private data of people and other entities is protected in the generated synthetic data.
One such mechanism is Rare category protection with the Constant method is enabled by default for all categorical columns. The Constant method masks rare categories with a
_RARE_ value. This default behavior helps to retain the distribution of categories from your original data.
Learn how Rare category protection works in the following sections and what you can do about unwanted
_RARE_ values in your synthetic data.
Rare category protection has two methods of operation: Constant and Sample.
With the Constant method, all rare categories in a categorical column are masked with the token
_RARE_ to protect any people or entities that fall into such categories from being re-identifiable in the synthetic data.
An example of a rare category in a
Job title column can be the category
President of the United States which makes the person behind the title instantly re-identifiable. With the Constant method, the category is masked with the
The goals of this approach are to:
- prevent the training of the AI model with rare categories
- retain the original distribution of categories in the synthetic data
The Constant method is the default method for all categorical columns and is the reason why
_RARE_ values appear in your synthetic data.
With the Sample method, MOSTLY AI replaces any rare categories by sampling values from other non-rare categories.
For example, in a
Job title column, the rare category
President of the United States can be replaced by sampling from a non-rare category, such as
Senior Account Manager.
While this method can prevent the appearance of
_RARE_ values in your synthetic data, it comes with the trade-off of skewing the original distribution in the categorical column by boosting the non-rare categories that MOSTLY AI samples from.
MOSTLY AI recommends that you do not disable Rare category protection for categorical columns.
If you do so, you risk exposing your synthetic data to re-identification attacks. This means that the Generative AI models in MOSTLY AI are trained with rare categories and generate rare categories in the synthetic data.
For example, in a Job title column, the categories
CFO, and other C-level positions typically appear only once per company. If you are processing a dataset that includes all employees at a company, the data of all C-level executives is immediately open to re-identification if Rare category protection is not enabled.
If you use the default Constant method, MOSTLY AI replaces the C-level positions with
_RARE_ values and keeps their data private. In this case, the distribution of the remaining categories is preserved in the synthetic data.
With the Sample method, MOSTLY AI randomly replaces the C-level positions with any of the other job titles in the column. This protects the data from re-identification but at the cost of accuracy due to the fact that the remaining job title categories are now redistributed in the synthetic data.
Use the Rare category protection method that makes the most sense for your categorical columns. Or you can also use a different encoding type or mock data for specific categorical columns as explained in the sections below.
In some cases, a categorical column in your synthetic data might contain only a few
_RARE_ values. In other cases, however, a categorical column might contain nothing but
It all depends on the data in your categorical columns. The sections below review specific examples of categorical data and how that can impact the number of
_RARE_ values that appear in your synthetic data.
Columns that contain first or last names of your data subjects are auto-assigned the Categorical encoding type and, by default, Rare category protection with the Constant method is enabled for such columns.
Such columns contain many distinct names which MOSTLY AI treats as rare categories. As a result, most or all values in such columns are
For such cases, you can use one of the three alternatives suggested below.
You can generate names with the Mock data generation method. To do so, you need to select the Person mock data type and the applicable name format:
- Full name
- First name
- Last name
Bear in mind that with Mock data for person names, MOSTLY AI generates English names.
You can use the Text encoding type to train the Generative AI models on the names in your original data, and then generate synthetic names.
Bear in mind that the Text encoding type might increase the computational time to train the Generative AI models as the names in the original dataset go through a process of tokenizing the names and analyzing the character sequence of each token.
Another approach is to pre-process your original data and exclude columns with names before you go on to generate synthetic data.
This approach adds an additional steps to the process of synthesizing data and MOSTLY AI recommends it only if you are familiar with the pre-processing of data.
If a column contains alphanumeric ID codes, MOSTLY AI auto-assigns the Categorical encoding type and enables Rare category protection with the Constant method.
Because each ID is unique in the column, the synthetic data for the column contain only
If the original column contains short strings with a consistent pattern (phone numbers, license plate numbers, company IDs, and so on), you can use the Character sequence encoding type to train the Generative AI model for this table with the code patterns.
If the original column is a primary key column, you can use the Primary key generation method to generate primary key values. You can specify the type of primary key that you want to generate from the Generation format drop-down menu.
- Sequential (
3, and so on)
- UUID (
- UUID dashless (
- UUID short (
If a column contains emails, MOSTLY AI again auto-assigns the Categorical encoding type and enables Rare category protection with the Constant method.
If each row in the column contains a unique email, all emails will appear as
_RARE_ values in the synthetic data.
_RARE_ values in email fields, you can use the Email mock data type.
MOSTLY AI generates mock emails and uses only popular email providers for the email domain.
Lookup tables include information about a finite number of entities. Examples of such entities can be
music_genres, and so on.
The sections below incldue best practices on how to handle lookup tables in your synthetic datasets.
If you can, avoid adding lookup tables to your synthetic datasets. Lookup tables can increase the training and generation times and their synthesis is not relevant to the privacy-protection of data subjects.
If you add lookup tables to your synthetic dataset in MOSTLY AI, due to the fact that all values in such tables are unique, MOSTLY AI will generate all of their categories as
As a best practice, copy your lookup tables to the destination database.
If you synthesize database tables, you need to consider how Rare category protection can impact the referential integrity of your destination database.
To illustrate the scenarios below, imagine that you have a
customer subject table and a
country lookup table. A foreign key relationship exists between the two tables via the
country_id column in the
As a best practice, keep the lookup table in the destination database and only synthesize the
customer table. A side effect of this best practice is that you can no longer use the Foreign key generation method for the
country_id column and MOSTLY AI auto-assigns the Categorical encoding type to the column.
MOSTLY AI auto-assigns the encoding type Categorical to the foreign key column
customertable. With the Constant method, Rare category protection generates
countrylookup table is unlikely to have the entry
_RARE_, this will most likely break the referential integrity.💡
In this case, to preserve the referential integrity, you could add a
_RARE_record to your
Rare category protection with Sample method
Before you synthesize, you can switch the Rare category protection method to Sample. In this case, MOSTLY AI replaces any rare categories by sampling values from non-rare categories.💡
With the Constant method, you can maintain the referential integrity but this comes at the cost of skewed distributions and lower accuracy.
Numeric:Discrete encoding type for the
You can switch the
country_idcolumn from the Categorical to the Numeric:Discrete encoding type. With this encoding type, MOSTLY AI treats the numeric or alphanumeric values in the column as categories.💡
Rare category protection does not impact columns with Numeric:Discrete encoding type.
This way, you can preserve the referential integrity with the