Click on the
Column details tab to optionally configure the training parameters. As you can specify different training parameters for each table in your dataset, select the table you want to configure from the table list and scroll down to the
Training parameters section.
With these parameters, you can:
Optimize AI model training for accuracy or speed.
Tweak the training parameters if the results of an earlier run were not of the desired accuracy or took too long to generate.
Disable the generation of detailed accuracy and privacy charts if the table’s synthetic data accuracy and privacy are in good shape.
For event tables, you can also limit the number of records per subject.
Read on below to learn more.
An epoch refers to the process of passing the table forward and backward through the neural network only once. MOSTLY AI will start new epochs until the neural network optimally learned your dataset’s features. Unfortunately, it’s not possible to know beforehand how many epochs are needed.
This setting allows you to limit the numbers of epochs to, for instance, 2, 5, or 10 — significantly reducing the time to generate your synthetic dataset but at the cost of accuracy.
MOSTLY AI won’t pass the entire dataset into the neural net at once. Instead, it divides your dataset into batches and updates the neural network’s parameters after each batch.
Setting the batch size to 1 will update these parameters after processing each training example. This results in the longest training time but allows you to process the largest possible models.
Setting a large batch size can significantly speed up the training but at the cost of memory.
If you get
The learning rate controls the extent that the neural network learns from its mistakes after each batch of training subjects. Suitable values are within the range of 0 and 1 and exist on an exponential scale.
0.1, for instance, is a high value, and for each step lower, you respectively have
A learning rate of
1 would mean that the neural net would very quickly process the training subjects, but it would fail to learn your dataset correctly.
Conversely, a very low learning rate, such as
0.00000001, would learn your dataset precisely but would take forever to complete the training.
|The learning rate is automatically optimized during the training process.|
This toggle allows you to select a suitable training stopping condition for your synthetic data generation task.
By default, MOSTLY AI will aim to achieve the highest attainable synthetic data accuracy. It stops the training when the validation loss stops to improve.
By switching the toggle to
speed, MOSTLY AI will stop the training as soon as the rate of improvement decreases. This significantly reduces the training time but at the cost of accuracy.
Generate full QA report toggle allows you to disable the generation of detailed accuracy and privacy charts for this table. An executive summary stating the synthetic data accuracy and whether the privacy tests passed or failed will always be available.
This option allows you to speed up the analysis of the synthetic data if its accuracy and privacy are in good shape. However, if the accuracy is lower than 90% or the privacy tests fail, the detailed accuracy and privacy charts will be generated anyway.
With subject table - event table datasets, you might have many records per subject that might not be relevant from a statistical point of view. By limiting the number of records, you can reduce the computational resources required to process your dataset.
For instance, bank transaction datasets often have a very skewed distribution of the number of events per customer. Customers ordinarily have 150 transactions per account on average, but there are also outliers with up to 1000 transactions.
MOSTLY AI allows you to limit the number of records per subject or drop the subject from the dataset entirely if they exceed this limit.
To limit the number of records per subject, specify the
Max records per subject and select from the dropdown menu how a subject is treated that exceeds this limit. Here, you can choose from
Yes, limit records, and
Yes, drop subjects.
|If your dataset contains inhomogeneous sequence length distributions, we recommend not to turn off Limit Records Per Subject. This feature reduces the privacy risk for outlier subjects.|