MOSTLY AI allows you to configure global job settings. These settings apply to all jobs and help optimize privacy protection, model training accuracy, and the overall duration of the synthesization process. Here, you can also manage user access to various synthesization features.
|Updated job settings only impact newly started jobs. Existing jobs will remain unchanged.|
Please navigate to the global job settings page by clicking on
Here you will find the page shown in the image below.
There are three sections on this page, Encoding, Training, and Generation, referring to the stages in a synthetic data generation job. Each of them has controls for adjusting synthesization behavior and user access to various synthesization features.
All settings are explained in the sections below.
For the encoding stage, you can adjust how MOSTLY AI protects rare categories, and specifically for Users-Accounts-Transactions datasets, configure how particular privacy-sensitive attributes are being protected.
For users, you can specify whether they can disable rare category protection and whether they can select the ITT encoding type.
- Frequency limit
This setting applies to Users-Accounts-Transactions datasets only and deals with protecting the privacy of users who have many accounts or uncommon account types.
These datasets may only have a few users who have many accounts. Also, if the dataset specifies account types, it may be the case that some of them are rarely used. These attributes pose a privacy risk for the users who have them.
With the frequency limit, you can set the minimum number of users for an account type or for a rare number of accounts to be considered for model training.
- Enable stochastic thresholds
To ensure that the privacy of outlier subjects is optimally protected, please keep stochastic thresholds enabled.
This feature increases protection for outliers and extreme values. It prevents potential attackers from determining the exact number of data subjects in the original data set that share a rare attribute. To this end, it probabilistically determines for each categorical column which rare categories appear and do not appear using a stochastic algorithm. This algorithm is based on a probability distribution as sketched below with a user-defined rare category threshold of 20.
This algorithm selects the rare categories as follows: the higher a category’s group size (the number of subjects it appears at), the higher the probability that this category will occur in the resulting synthetic dataset. The user-defined threshold value determines where the center of the distribution is.
With stochastic thresholds in place, every new job will produce a synthetic dataset with different categories close to the user-defined threshold. To better understand how it prevents potential attacks, let’s look at a scenario that’s susceptible to it:
A MOSTLY AI user synthesizes the same dataset twice. A column in this dataset contains three rare categories: A - appearing 15 times, B - 19 times, and C - 30 times. The first job was configured with a fixed threshold of 20 and the second one with 19. This change makes the B category appear in the resulting synthetic dataset.
Suppose both synthetic versions came into the hands of an attacker. In that case, they could infer from this difference that category B was a member of the original training dataset and that it was present exactly 19 times.
Stochastic thresholds improve protection for outliers and extreme values in the following way. Consider synthesizing the dataset with the 17.000 baseball players of the American Major League Baseball tournament that’s available in the Resources section. The
countrycolumn of this dataset contains about 45 rare categories, of which 6 of them lie within the range of the stochastic distribution depicted above.
For the first job, MOSTLY AI selects three of them to appear in the resulting synthetic dataset — Curacao, Australia, and Germany, with a respective group size of 18, 24, and 26 subjects.
But for the second job, MOSTLY AI selects another set to appear in the dataset, namely Colombia, South Korea, and Germany, with a respective group size of 20, 22, and 26 subjects.
- Minimum rare category protection threshold
Specify a global minimum rare category threshold value. Users won’t be able to set values below this value. For stochstic thresholds, this value sets the cutoff point of the probability distribution.
- Enable ITT encoding type
Ticking this checkbox allows users to select the ITT encoding type for the
datetimecolumns of their sequence tables.
- Users can disable rare category protection
If checked, users can set the rare category protection threshold for their categorical columns to
0, effectively turning it off.
For the training stage, you can specify the resources that the training algorithm can use. Increasing them improves synthetic data accuracy, but at the cost of larger memory consumption.
For users, you can specify whether they can enable consistency correction for their job.
Here you can configure the Regressor, Context Processor, and History Encoder components of the training algorithm.
Each of these components consists of layers that contain units. For each field, you can enter a list of comma-separated values to specify how many units each layer has. The number of items in this list specifies the number of layers. The list
128, 64, 32, for example, sets three layers with 128, 64, and 32 units, respectively. These values don’t have to be powers of two.
You can specify a maximum of three layers per component type. You can consider adding layers when a component needs to process highly complex relationships and distributions in a dataset.
|Only make slight changes to these parameters. Considerably increasing them could promote overfitting.|
- Number of units per layer for Regressors
Regressors are one of the components in the synthetic data generation model that represent a dataset’s columns. By adjusting the number of units per layer, you can improve synthetic data accuracy, but at the cost of larger memory consumption. A typical number of units for a Regressor component layer would be 20.
A plausible range for the number of units would be between 20 and 256. A suitable value would depend on the categorical columns' cardinality, the complexity of the distributions of the numeric and datetime columns, and the complexity of the dependencies with other columns for all data types.
You can enter values between 16 and 512.
If you leave the fields blank, then MOSTLY AI will calculate settings — before synthesizing a dataset — that are reasonable given that dataset’s structure.
- Number of units per layer for the Context Processor
The Context Processor summarizes the subject table into the context used for conditional linked table generation. By adjusting the number of units per layer, you can improve synthetic data accuracy, but at the cost of larger memory consumption. A typical number of units for a Context Processor layer would be 512.
A plausible range for the number of units would be between 32 and 2048. A suitable value would depend on the number and complexity of columns in the subject table and how strongly they influence the linked table.
You can enter values between 16 and 8192.
- Number of units per layer for the History Encoder
The History Encoder creates the historical representation of a sequence. By adjusting the number of units per layer, you can improve synthetic data accuracy, but at the cost of larger memory consumption. A typical number of units for a History Encoder layer would be 512.
A plausible range for the number of units would be between 64 and 2048. A suitable value would depend on typical sequence lengths, the number and complexity of the linked table’s columns, and the strength of its temporal dependencies.
You can enter values between 16 and 8192.
- Consistency correction
Tick this box to allow users to enable consistency correction for their synthetic data generation job. This feature biases the categorical column of the resulting synthetic dataset towards having less diversity in your data subjects' sequences.
Consider synthesizing a dataset that contains customer’s grocery purchases in a supermarket over time. Supermarkets carry about 40.000 items on average, whereas each customer tends to consistently buy more or less the same products each time they visit.
When synthesizing the dataset without consistency correction, the number of unique products per customer may become too large. The model will assign a small probability to each of the 40.000 products, summing up to an exaggerated probability of buying some new items.
Consistency correction is an additional feature that shifts the model’s predictions in favor of items that previously appeared in that customer’s sequence. It reduces the probability of new items being purchased every visit.
Consistency correction helps with high cardinality columns. In this dataset, this would be the column with purchased items. The total number of unique items in this column would probably come close to the variety carried by the supermarket. However, consistency correction can already be helpful even if a column’s cardinality is only 50 - 100, for as long as the column should have a high consistency over time.
Enabling consistency correction increases memory and computational requirements. We recommend turning it on only when there’s a real need for it.
- Enable GPU training
Tick this box to allow users to train their synthetic data generation models on GPU hardware. Training on GPUs can be considerably faster for sequential datasets that only have a few columns. The choice of hardware won’t affect synthetic data quality or privacy.
- Generation Batch Size
Adjust the number of subjects that are generated simultaneously. For subject tables, this refers to the number of rows that are created at the same time. For time-series datasets, this also includes the generation of entire sequences for each subject. Increasing this value also increases memory consumption.
- Allow users to download the synthetic data before the QA report is generated
By ticking this checkbox, you allow users to download the synthetic data at the earliest possible moment. This, however, bears the risk that the privacy and accuracy of the synthetic data are not known.