A categorical variable has a fixed set of possible values that are already present in the input data. An example of such a variable is T-shirt size, which consists of the following categories: 'XS, S, M, L, XL'. Categorical variables prevent random values (for instance, 'XM, A, B') from appearing in your synthetic dataset.

The Rare category protection settings appear below the Encoding type section when selecting this type. These settings help with protecting rare categories. Such categories may cause re-identification of outliers among your data subjects if they’re present in the resulting synthetic data.

There are two rare category protection methods available with which you can mask these categories:

Constant

Replaces the rare categories with a constant value. The default constant value is *.

Sample data

Replaces the rare categories with the categories that will appear in the synthetic version
of this column.

With the Threshold parameter, you can specify when a category is considered rare.
A Threshold of 20 means that if a category is only present at 20 or fewer subjects, this category will be masked using the method you specified

The two charts below demonstrate how this works using the Baseball dataset. This dataset contains the records of 19,000 professional baseball players from 57 different countries, describing their country of origin, name, weight, height, etc.

Baseball is a popular sport in the U.S.A and some other countries in that region of the world. It’s therefore rare to find professional baseball players in European or Asian countries. The below chart — depicting the original dataset — reflects this distribution of baseball players. Over 16,000 baseball players come from the U.S.A, whereas there are only a few players in Belgium, Austria, or the Philippines.

rare category protection 1

You can prevent these baseball players from being identified by their country of origin by masking these categories. The below chart — depicting the synthetic dataset — shows the result of replacing them with the * label. Of the 58 countries in the original dataset are only 16 visible in the synthetic dataset. The subjects of the remaining 43 countries are in the new * category, preventing the re-identification of that sole baseball player in Greece, Indonesia, or Singapore.

rare category protection 2

Another use case for the categorical encoding type is postal codes (or ZIP codes). Specifying them as categorical rather than as a numeric column makes a big difference in the synthetic data generation process.

Most countries use numeric postal code systems. Only a few in the world are alphanumeric. MOSTLY AI automatically detects these as categories. With numeric systems, it’s likely to assign the Numeric encoding type.

We highly recommend verifying the encoding type for your postal code column. If the synthesization process uses the Numeric encoding type, unique postal codes may appear that aren’t present in the original dataset.

Setting the threshold lower than 20 may introduce privacy risks.
Aside from privacy protection, setting this value reduces computational resources for high-cardinality columns.