_RARE_ values

_RARE_ values

MOSTLY AI protects the privacy of your data with a number of mechanisms that make sure that no private data of people or other entities appears in your generated synthetic data.

Value protection is on by default for each generator you train. One of the mechanisms included in Value protection is the Rare category protection which is also on by default and has two modes of operation. Learn about each mode and how it impacts the appearance of _RARE_ values in your synthetic datasets.

Methods of rare category protection

You can configure Rare category protection to replace rare categories in categorical columns with either the Constant or the Sample replacement method.

Constant method

With the Constant method, all rare categories in a categorical column are masked with the token _RARE_ to protect any people or entities from being re-identifiable in the synthetic data. The replacement occurs automatically before your generator starts training.

An example of a rare category in a Job title column can be the category President of the United States which makes the person behind the title instantly re-identifiable. With the Constant method, the category is masked with the _RARE_ token before model training.

The goals of this approach are to:

  • prevent the training of the AI model with rare categories
  • retain the original distribution of categories in the synthetic data

The Constant method is the default method for all categorical columns and is the reason why _RARE_ values appear in your synthetic data.

Sample method

With the Sample method, MOSTLY AI replaces any rare categories by sampling non-rare categories from the same column.

For example, in a Job title column, the rare category President of the United States can be replaced by sampling from a non-rare category, such as Senior Account Manager.

With this method, you can prevent _RARE_ values, but it comes with the trade-off of skewing the original distribution in the categorical column by boosting any non-rare categories that MOSTLY AI samples from.

Rare category protection modes - Constant and Sample

Do not disable Value protection

MOSTLY AI recommends that you do not disable Value protection for categorical columns.

If you do so, you risk exposing your synthetic data to re-identification attacks. This means that the Generative AI models in MOSTLY AI are trained with rare categories and will generate rare categories in the synthetic data.

For example, in a Job title column, the categories CEO, CTO, CFO, and other C-level positions typically appear only once per company. If you are processing a dataset that includes all employees at a company, the data of all C-level executives is immediately open to re-identification if Value protection is not enabled.

If you use the default Constant method, MOSTLY AI replaces the C-level positions with _RARE_ values and keeps their data private. In this case, the distribution of the remaining categories is preserved in the synthetic data.

With the Sample method, MOSTLY AI randomly replaces the C-level positions with any of the other job titles in the column. This protects the data from re-identification but at the cost of accuracy due to the fact that the remaining job title categories are now redistributed in the synthetic data.

Use the Rare category replacement method that makes the most sense for your categorical columns. Or you can also use a different encoding type for specific categorical columns as explained in the sections below.

Cases with many _RARE_ values

In some cases, a categorical column in your synthetic data might contain only a few _RARE_ values. In other cases, however, a categorical column might contain nothing but _RARE_ values.

It all depends on the data in your categorical columns. The sections below review specific examples of categorical data and how that can impact the number of _RARE_ values that appear in your synthetic data.

Columns for names

Columns that contain first or last names of your data subjects are auto-assigned the Categorical encoding type and, by default, Value protection with the Constant replacement method of rare categories is enabled for such columns.

Such columns contain many distinct names which MOSTLY AI treats as rare categories. As a result, most or all values in such columns are _RARE_.

For such cases, you can use one of the alternatives suggested below.

Use the Text encoding type

You can use the Text encoding type to train the Generative AI models on the names in your original data, and then generate synthetic names.

Bear in mind that the Text encoding type might increase the computational time to train the Generative AI models as the names in the original dataset go through a process of tokenizing the names and analyzing the character sequence of each token.

Pre-process your data to exclude names

Another approach is to pre-process your original data and exclude columns with names before you go on to generate synthetic data.

This approach adds an prerequisite to the process of synthesizing data and MOSTLY AI recommends it only if you are familiar how to pre-process your data.

Columns for codes

If a column contains alphanumeric ID codes, MOSTLY AI auto-assigns the Categorical encoding type and enables the Constant replacement method.

Because each ID is unique in the column, the synthetic data for the column contain only _RARE_ values.

Use Character encoding type

If the original column contains short strings with a consistent pattern (phone numbers, license plate numbers, company IDs, and so on), you can use the Character encoding type to train the Generative AI model for this table with the code patterns.

Columns with codes - use Character column type

Set a primary key

If the original column is a primary key column, you can set a Primary key to generate primary key values. MOSTLY AI generates integer or UUID primary keys in the column depending on its primary key format.

Set primary key

Pre-process your data to exclude primary keys

If you do not intend to set a primary key on a table or point any foreign keys to it, you can pre-process the table to exclude the primary key column. Alternatively, deselect the Include checkbox on the Data configuration page of a generator.

Columns for emails

If a column contains emails, MOSTLY AI again auto-assigns the Categorical encoding type and enables Rare category protection with the Constant method.

If each row in the column contains a unique email, all emails will appear as _RARE_ values in the synthetic data.

Lookup tables

Lookup tables include information about a finite number of entities. Examples of such entities can be countries, cities, phone codes, music_genres, and so on.

The sections below incldue best practices on how to handle lookup tables in your synthetic datasets.

When lookup tables contain only _RARE_ values

If you can, avoid adding lookup tables to your synthetic datasets. Lookup tables can increase the training and generation times and their synthesis is not relevant to the privacy-protection of data subjects.

If you add lookup tables to your synthetic dataset in MOSTLY AI, due to the fact that all values in such tables are unique, MOSTLY AI will generate all of their categories as _RARE_.

Copy lookup tables to your destination database

As a best practice, copy your lookup tables to the destination database.

Referential integrity with lookup tables

If you synthesize database tables, you might need to consider how Value protection for rare categories impacts the referential integrity of your destination database.

To illustrate the scenarios below, imagine that you have a customer subject table and a country lookup table. A foreign key relationship exists between the two tables via the country_id column in the customer table.

As a best practice, keep the lookup table in the destination database and only synthesize the customer table. A side effect of this best practice is that you can no longer set a foreign key for the country_id column and MOSTLY AI auto-assigns the Categorical encoding type to the column.

  • Default scenario

    MOSTLY AI auto-assigns the encoding type Categorical to the foreign key column country_id in the customer table. With the Constant replacement method, Value protection generates _RARE_ values.

    Because a country lookup table is unlikely to have the entry _RARE_, this will break the referential integrity.

    In this case, to preserve the referential integrity, you could add a _RARE_ record to your country lookup table.

  • Rare category protection with Sample method

    Before you synthesize, you can switch the Rare category replacement method to Sample. In this case, MOSTLY AI replaces any rare categories by sampling values from non-rare categories.

    With the Constant method, you can maintain the referential integrity but this comes at the cost of skewed distributions and lower accuracy.

  • Numeric:Discrete column type for the country_id column

    You can switch the country_id column from the Categorical to the Numeric:Discrete column type. With this column type, MOSTLY AI treats the numeric or alphanumeric values in the column as categories.

    Rare category protection does not impact columns with Numeric:Discrete column type.

    This way, you can preserve the referential integrity with the country table.