A categorical variable has a fixed set of possible values that are already present in the input data. An example of such a variable is
T-shirt size, which consists of the following categories: 'XS, S, M, L, XL'. Categorical variables prevent random values (for instance, 'XM, A, B') from appearing in your synthetic dataset.
Rare category protection settings appear below the
Encoding type section when selecting this type. These settings help with protecting rare categories. Such categories may cause re-identification of outliers among your data subjects if they’re present in the resulting synthetic data.
There are two rare category protection methods available with which you can mask these categories:
Replaces the rare categories with a constant value. The default constant value is
Replaces the rare categories with the categories that will appear in the synthetic version