_RARE_ values

_RARE_ values

MOSTLY AI employs a number of privacy-protection mechanisms to ensure that the private data of people and other entities is protected in the generated synthetic data.

One such mechanism is Rare category protection with the Constant method is enabled by default for all categorical columns. The Constant method masks rare categories with a _RARE_ value. This default behavior helps to retain the distribution of categories from your original data.

Learn how Rare category protection works in the following sections and what you can do about unwanted _RARE_ values in your synthetic data.

Methods of rare category protection

Rare category protection has two methods of operation: Constant and Sample.

Constant method

With the Constant method, all rare categories in a categorical column are masked with the token _RARE_ to protect any people or entities that fall into such categories from being re-identifiable in the synthetic data.

An example of a rare category in a Job title column can be the category President of the United States which makes the person behind the title instantly re-identifiable. With the Constant method, the category is masked with the _RARE_ token.

The goals of this approach are to:

  • prevent the training of the AI model with rare categories
  • retain the original distribution of categories in the synthetic data

The Constant method is the default method for all categorical columns and is the reason why _RARE_ values appear in your synthetic data.

Sample method

With the Sample method, MOSTLY AI replaces any rare categories by sampling values from other non-rare categories.

For example, in a Job title column, the rare category President of the United States can be replaced by sampling from a non-rare category, such as Senior Account Manager.

While this method can prevent the appearance of _RARE_ values in your synthetic data, it comes with the trade-off of skewing the original distribution in the categorical column by boosting the non-rare categories that MOSTLY AI samples from.

Rare category protection - Constant

Disabling Rare category protection

MOSTLY AI recommends that you do not disable Rare category protection for categorical columns.

If you do so, you risk exposing your synthetic data to re-identification attacks. This means that the Generative AI models in MOSTLY AI are trained with rare categories and generate rare categories in the synthetic data.

For example, in a Job title column, the categories CEO, CTO, CFO, and other C-level positions typically appear only once per company. If you are processing a dataset that includes all employees at a company, the data of all C-level executives is immediately open to re-identification if Rare category protection is not enabled.

If you use the default Constant method, MOSTLY AI replaces the C-level positions with _RARE_ values and keeps their data private. In this case, the distribution of the remaining categories is preserved in the synthetic data.

With the Sample method, MOSTLY AI randomly replaces the C-level positions with any of the other job titles in the column. This protects the data from re-identification but at the cost of accuracy due to the fact that the remaining job title categories are now redistributed in the synthetic data.

Use the Rare category protection method that makes the most sense for your categorical columns. Or you can also use a different encoding type or mock data for specific categorical columns as explained in the sections below.

Cases with many _RARE_ values

In some cases, a categorical column in your synthetic data might contain only a few _RARE_ values. In other cases, however, a categorical column might contain nothing but _RARE_ values.

It all depends on the data in your categorical columns. The sections below review specific examples of categorical data and how that can impact the number of _RARE_ values that appear in your synthetic data.

Columns for names

Columns that contain first or last names of your data subjects are auto-assigned the Categorical encoding type and, by default, Rare category protection with the Constant method is enabled for such columns.

Such columns contain many distinct names which MOSTLY AI treats as rare categories. As a result, most or all values in such columns are _RARE_ values.

For such cases, you can use one of the three alternatives suggested below.

Use Mock data for names

You can generate names with the Mock data generation method. To do so, you need to select the Person mock data type and the applicable name format:

  • Full name
  • First name
  • Last name
Mock data - Person - Last name

Bear in mind that with Mock data for person names, MOSTLY AI generates English names.

Use the Text encoding type

You can use the Text encoding type to train the Generative AI models on the names in your original data, and then generate synthetic names.

Bear in mind that the Text encoding type might increase the computational time to train the Generative AI models as the names in the original dataset go through a process of tokenizing the names and analyzing the character sequence of each token.

Pre-process your data to exclude names

Another approach is to pre-process your original data and exclude columns with names before you go on to generate synthetic data.

This approach adds an additional steps to the process of synthesizing data and MOSTLY AI recommends it only if you are familiar with the pre-processing of data.

Columns for codes

If a column contains alphanumeric ID codes, MOSTLY AI auto-assigns the Categorical encoding type and enables Rare category protection with the Constant method.

Because each ID is unique in the column, the synthetic data for the column contain only _RARE_ values.

Use Character sequence encoding type

If the original column contains short strings with a consistent pattern (phone numbers, license plate numbers, company IDs, and so on), you can use the Character sequence encoding type to train the Generative AI model for this table with the code patterns.

Mock data - Person - Last name

Use Primary key generation method

If the original column is a primary key column, you can use the Primary key generation method to generate primary key values. You can specify the type of primary key that you want to generate from the Generation format drop-down menu.

  • Sequential (1, 2, 3, and so on)
  • UUID (550e8400-e29b-41d4-a716-446655440000)
  • UUID dashless (550e8400e29b41d4a716446655440000)
  • UUID short (xhvVdrZD4iP5T8vBxuxm76)
Generation method - Primary key

Columns for emails

If a column contains emails, MOSTLY AI again auto-assigns the Categorical encoding type and enables Rare category protection with the Constant method.

If each row in the column contains a unique email, all emails will appear as _RARE_ values in the synthetic data.

Use Email mock data

To avoid _RARE_ values in email fields, you can use the Email mock data type.

Mock data - Email

MOSTLY AI generates mock emails and uses only popular email providers for the email domain.

Lookup tables

Lookup tables include information about a finite number of entities. Examples of such entities can be countries, cities, phone codes, music_genres, and so on.

The sections below incldue best practices on how to handle lookup tables in your synthetic datasets.

When lookup tables contain only _RARE_ values

If you can, avoid adding lookup tables to your synthetic datasets. Lookup tables can increase the training and generation times and their synthesis is not relevant to the privacy-protection of data subjects.

If you add lookup tables to your synthetic dataset in MOSTLY AI, due to the fact that all values in such tables are unique, MOSTLY AI will generate all of their categories as _RARE_.

Copy lookup tables to your destination database

As a best practice, copy your lookup tables to the destination database.

Referential integrity with lookup tables

If you synthesize database tables, you need to consider how Rare category protection can impact the referential integrity of your destination database.

To illustrate the scenarios below, imagine that you have a customer subject table and a country lookup table. A foreign key relationship exists between the two tables via the country_id column in the customer table.

As a best practice, keep the lookup table in the destination database and only synthesize the customer table. A side effect of this best practice is that you can no longer use the Foreign key generation method for the country_id column and MOSTLY AI auto-assigns the Categorical encoding type to the column.

  • Default scenario

    MOSTLY AI auto-assigns the encoding type Categorical to the foreign key column country_id in the customer table. With the Constant method, Rare category protection generates _RARE_ values.

    Because a country lookup table is unlikely to have the entry _RARE_, this will most likely break the referential integrity.


    In this case, to preserve the referential integrity, you could add a _RARE_ record to your country lookup table.

  • Rare category protection with Sample method

    Before you synthesize, you can switch the Rare category protection method to Sample. In this case, MOSTLY AI replaces any rare categories by sampling values from non-rare categories.


    With the Constant method, you can maintain the referential integrity but this comes at the cost of skewed distributions and lower accuracy.

  • Numeric:Discrete encoding type for the country_id column

    You can switch the country_id column from the Categorical to the Numeric:Discrete encoding type. With this encoding type, MOSTLY AI treats the numeric or alphanumeric values in the column as categories.


    Rare category protection does not impact columns with Numeric:Discrete encoding type.

    This way, you can preserve the referential integrity with the country table.