A categorical variable has a fixed set of possible values that are already present in the input data. An example of such a variable is T-shirt size, which consists of the following categories: 'XS, S, M, L, XL'. Categorical variables prevent random values (for instance, 'XM, A, B') from appearing in your synthetic dataset.

The Rare category protection settings appear below the Encoding type section when selecting this type. These settings help with protecting rare categories. Such categories may cause re-identification of outliers among your data subjects if they’re present in the resulting synthetic data.

There are two rare category protection methods available with which you can mask these categories:

Constant

Replaces rare categories with the value _RARE_.

Sample

Replaces the rare categories with the categories that will appear in the synthetic version
of this column.

Select the method that’s best suited for your use case.

For categorical columns in linked tables, there’s also the Consistency correction setting. Activating consistency correction lets you bias this column towards having less diversity in your data subjects' sequences.

consistency correction

It helps the model remember all the previously generated values of a categorical column, so it can learn whether it needs to boost the probability of a particular category given that it has already been generated for a given subject.

To clarify how this works, let’s consider synthesizing a dataset containing customers' grocery purchases in a supermarket over time. Supermarkets carry about 40.000 items on average, whereas customers tend to consistently buy more or less the same products each time they visit.

When synthesizing the dataset without consistency correction, the number of unique products per customer may become too large. The model will assign a small probability to each of the 40.000 products, summing up to an exaggerated probability of buying some new items.

By activating consistency correction, you’re shifting the model’s predictions in favor of items that previously appeared in that customer’s sequence—as a result, reducing the probability of new items being purchased every visit.

Consistency correction helps with high cardinality columns. In this dataset, this would be the column with purchased items. The total number of unique items in this column would probably come close to the variety carried by the supermarket. However, consistency correction can already be helpful even if a column’s cardinality is only 50 - 100, for as long as the column should have a high consistency over time.

Enabling consistency correction increases memory and computational requirements. We recommend turning it on only when there’s a real need for it.