_RARE_
values
MOSTLY AI employs a number of privacy-protection mechanisms to ensure that the private data of people and other entities is protected in the generated synthetic data.
One such mechanism is Rare category protection with the Constant method is enabled by default for all categorical columns. The Constant method masks rare categories with a _RARE_
value. This default behavior helps to retain the distribution of categories from your original data.
Learn how Rare category protection works in the following sections and what you can do about unwanted _RARE_
values in your synthetic data.
Methods of rare category protection
Rare category protection has two methods of operation: Constant and Sample.
Constant method
With the Constant method, all rare categories in a categorical column are masked with the token _RARE_
to protect any people or entities that fall into such categories from being re-identifiable in the synthetic data.
An example of a rare category in a Job title
column can be the category President of the United States
which makes the person behind the title instantly re-identifiable. With the Constant method, the category is masked with the _RARE_
token.
The goals of this approach are to:
- prevent the training of the AI model with rare categories
- retain the original distribution of categories in the synthetic data
The Constant method is the default method for all categorical columns and is the reason why _RARE_
values appear in your synthetic data.
Sample method
With the Sample method, MOSTLY AI replaces any rare categories by sampling values from other non-rare categories.
For example, in a Job title
column, the rare category President of the United States
can be replaced by sampling from a non-rare category, such as Senior Account Manager
.
While this method can prevent the appearance of _RARE_
values in your synthetic data, it comes with the trade-off of skewing the original distribution in the categorical column by boosting the non-rare categories that MOSTLY AI samples from.

Disabling Rare category protection
MOSTLY AI recommends that you do not disable Rare category protection for categorical columns.
If you do so, you risk exposing your synthetic data to re-identification attacks. This means that the Generative AI models in MOSTLY AI are trained with rare categories and generate rare categories in the synthetic data.
For example, in a Job title column, the categories CEO
, CTO
, CFO
, and other C-level positions typically appear only once per company. If you are processing a dataset that includes all employees at a company, the data of all C-level executives is immediately open to re-identification if Rare category protection is not enabled.
If you use the default Constant method, MOSTLY AI replaces the C-level positions with _RARE_
values and keeps their data private. In this case, the distribution of the remaining categories is preserved in the synthetic data.
With the Sample method, MOSTLY AI randomly replaces the C-level positions with any of the other job titles in the column. This protects the data from re-identification but at the cost of accuracy due to the fact that the remaining job title categories are now redistributed in the synthetic data.
Use the Rare category protection method that makes the most sense for your categorical columns. Or you can also use a different encoding type or mock data for specific categorical columns as explained in the sections below.
Cases with many _RARE_
values
In some cases, a categorical column in your synthetic data might contain only a few _RARE_
values. In other cases, however, a categorical column might contain nothing but _RARE_
values.
It all depends on the data in your categorical columns. The sections below review specific examples of categorical data and how that can impact the number of _RARE_
values that appear in your synthetic data.
Columns for names
Columns that contain first or last names of your data subjects are auto-assigned the Categorical encoding type and, by default, Rare category protection with the Constant method is enabled for such columns.
Such columns contain many distinct names which MOSTLY AI treats as rare categories. As a result, most or all values in such columns are _RARE_
values.
For such cases, you can use one of the three alternatives suggested below.
Use Mock data for names
You can generate names with the Mock data generation method. To do so, you need to select the Person mock data type and the applicable name format:
- Full name
- First name
- Last name

Bear in mind that with Mock data for person names, MOSTLY AI generates English names.
Use the Text encoding type
You can use the Text encoding type to train the Generative AI models on the names in your original data, and then generate synthetic names.
Bear in mind that the Text encoding type might increase the computational time to train the Generative AI models as the names in the original dataset go through a process of tokenizing the names and analyzing the character sequence of each token.
Pre-process your data to exclude names
Another approach is to pre-process your original data and exclude columns with names before you go on to generate synthetic data.
This approach adds an additional steps to the process of synthesizing data and MOSTLY AI recommends it only if you are familiar with the pre-processing of data.
Columns for codes
If a column contains alphanumeric ID codes, MOSTLY AI auto-assigns the Categorical encoding type and enables Rare category protection with the Constant method.
Because each ID is unique in the column, the synthetic data for the column contain only _RARE_
values.
Use Character sequence encoding type
If the original column contains short strings with a consistent pattern (phone numbers, license plate numbers, company IDs, and so on), you can use the Character sequence encoding type to train the Generative AI model for this table with the code patterns.

Use Primary key generation method
If the original column is a primary key column, you can use the Primary key generation method to generate primary key values. You can specify the type of primary key that you want to generate from the Generation format drop-down menu.
- Sequential (
1
,2
,3
, and so on) - UUID (
550e8400-e29b-41d4-a716-446655440000
) - UUID dashless (
550e8400e29b41d4a716446655440000
) - UUID short (
xhvVdrZD4iP5T8vBxuxm76
)

Columns for emails
If a column contains emails, MOSTLY AI again auto-assigns the Categorical encoding type and enables Rare category protection with the Constant method.
If each row in the column contains a unique email, all emails will appear as _RARE_
values in the synthetic data.
Use Email mock data
To avoid _RARE_
values in email fields, you can use the Email mock data type.

MOSTLY AI generates mock emails and uses only popular email providers for the email domain.
Lookup tables
Lookup tables include information about a finite number of entities. Examples of such entities can be countries
, cities
, phone codes
, music_genres
, and so on.
The sections below incldue best practices on how to handle lookup tables in your synthetic datasets.
When lookup tables contain only _RARE_
values
If you can, avoid adding lookup tables to your synthetic datasets. Lookup tables can increase the training and generation times and their synthesis is not relevant to the privacy-protection of data subjects.
If you add lookup tables to your synthetic dataset in MOSTLY AI, due to the fact that all values in such tables are unique, MOSTLY AI will generate all of their categories as _RARE_
.
Copy lookup tables to your destination database
As a best practice, copy your lookup tables to the destination database.
Referential integrity with lookup tables
If you synthesize database tables, you need to consider how Rare category protection can impact the referential integrity of your destination database.
To illustrate the scenarios below, imagine that you have a customer
subject table and a country
lookup table. A foreign key relationship exists between the two tables via the country_id
column in the customer
table.
As a best practice, keep the lookup table in the destination database and only synthesize the customer
table. A side effect of this best practice is that you can no longer use the Foreign key generation method for the country_id
column and MOSTLY AI auto-assigns the Categorical encoding type to the column.
-
Default scenario
MOSTLY AI auto-assigns the encoding type Categorical to the foreign key columncountry_id
in thecustomer
table. With the Constant method, Rare category protection generates_RARE_
values.Because a
country
lookup table is unlikely to have the entry_RARE_
, this will most likely break the referential integrity.💡In this case, to preserve the referential integrity, you could add a
_RARE_
record to yourcountry
lookup table.
-
Rare category protection with Sample method
Before you synthesize, you can switch the Rare category protection method to Sample. In this case, MOSTLY AI replaces any rare categories by sampling values from non-rare categories.💡With the Constant method, you can maintain the referential integrity but this comes at the cost of skewed distributions and lower accuracy.
-
Numeric:Discrete encoding type for the
country_id
column
You can switch thecountry_id
column from the Categorical to the Numeric:Discrete encoding type. With this encoding type, MOSTLY AI treats the numeric or alphanumeric values in the column as categories.💡Rare category protection does not impact columns with Numeric:Discrete encoding type.
This way, you can preserve the referential integrity with the
country
table.