To generate a synthetic database from a data catalog, click on Data catalog
in the left side menu and select the data catalog you want to synthesize.

The same job settings tabs appear as when the data catalog was created. Let’s walk through the steps below to launch your job.
-
Click on the
Settings
tab to optionally adjust the number of training and generated subjects and select a destination for the synthetic data. -
If you want to review the job settings, click on the
Column details
tab and browse the settings.
However, you cannot make any changes to them. To do so, selectData catalogs
from the left side main menu, and select the data catalog you want to edit. -
Lastly, click
Launch job
to start the synthetic data generation job.
Observing the synthesization job
A page appears that informs you about the status of your job.
Once MOSTLY AI completes the synthetization job, the selected data destination will contain the synthetic version of your dataset.
There are two sections on this page that inform you about the synthetic data generation process:
-
The top section provides general information:
Job name The name of the job as it was specified during its configuration.
Job type There are four jobs types:
-
Ad hoc synthesizes a dataset uploaded using the web UI.
-
Data catalog synthesizes a database or dataset stored in a cloud bucket
or local server. -
Generate with subject count creates a specified number of new synthetic
subjects from a previous job’s readily trained AI model. -
Generate with seed creates a linked table for an uploaded subject table
using a previous job’s readily trained AI model.
Uploaded This field indicates when the original dataset was uploaded.
Tasks completed The number of tasks in this job that have been completed.
Data catalog The name of the data catalog that is being synthesized.
Data destination The destination where the synthetic data will be written to.
-
-
The
Job summary
section informs you about the synthesization tasks currently being performed. It shows which tables are being synthesized, the current tasks, their status, and the total duration. In addition, you can click on thekebab
icon on the right side of each entry to see a detailed task list or an overview of the columns' generation methods and encoding types.A task list appears when you choose
View tasks
from the kebab menu. The table below provides an overview of all the tasks and steps you will see in this list.Task Step Description Synthetizing table
Generating textOrganizing data
Ensures that very large tables can be processed regardless of system memory size.
Data analysis
The table is analyzed for its data types and unique values.
Transforming data
The table is transformed for efficient processing.
AI training
Using generative neural networks, a model is trained to retain your dataset’s granularity, statistical correlations, structures, and time-dependencies.
Generating synthetic data
The resulting AI model is used to create a synthetic version of the table.
Packaging synthetic data
Creating zip archive
Creates a ZIP archive with the synthetic version of the dataset.
Creating the quality assurance
reportAnalyzing synthetic data for quality and accuracy
The resulting synthetic table is tested against the original for accuracy and privacy. It checks for identical information matches and whether the synthetic subjects are dissimilar enough to the original subjects to prevent re-identification.
To learn more about the AI training step, you can click on
View training logs
to see how the AI model training is going. Here you can find a chart depicting the training and validation loss per epoch.If you consider the model to be sufficiently trained, or you want to speed up the synthetization process, you can click on
Stop training
to skip to the synthetic data generation step.
After the job is completed
The synthetic data’s QA report appears once the synthetization job is completed. It consists of an executive summary and the report’s detailed privacy and accuracy charts, which are accessible by clicking on View full report
at the top-right of the summary.
The View full report option may not be available if the dataset’s privacy and accuracy are in good shape and Generate detailed QA report was disabled when configuring the job or data catalog.
|

The executive summary provides general information about the synthetic dataset’s accuracy, whether it passed the privacy tests, the number and type of columns, and the number of generated subjects.
The green Accuracy
and Privacy tests
checkmarks indicate whether the synthetization was successful and that you can share the synthetic data across your business and partnerships.
Synthetic datasets that failed the privacy tests are not privacy-safe. If you have a failed dataset, please rerun your job using the remediating steps below. |
Viewing the privacy and accuracy charts
The privacy and accuracy charts are organized by the following tabs:
-
Privacy
-
Univariate distributions
-
Correlations
-
Bivariate distributions
-
Accuracy

You can learn more about specific data points by hovering your mouse pointer over a chart. Each of the univariate and bivariate charts also have an expand
button at their top-right corner, which you can click to see a larger version of them.
You can find a detailed explanation of our privacy and accuracy metrics in the Reading the QA report guide. |
Viewing the linked table accuracy charts
If your dataset contains a linked table, you’ll find some charts in the Univariate distributions
and Bivariate distributions
tabs that depict its accuracy.
You can recognize these univariate distributions as follows:
<linked table>:count
|
This chart depicts the distribution of linked table rows per subject. |
Linked table column charts |
These charts have a |

You can see a similar naming convention for the bivariate charts. The linked table columns have a 1
in their name, and the charts depict analyses between a subject table and a linked table column, or two linked table columns.
Downloading the QA report
You can also download a PDF verison of the QA report. There are two options to do so:
-
In the job details at the top, click on the
kebab
icon and selectDownload QA report
from the menu. -
In the
Job summary
, scroll down to theAnalyzing
section and click on theDownload QA report
button.
Sharing your job
You can share your job, including its synthetic data and the QA report, with other user groups. The sharing options also let you grant read access to all authenticated users or transfer the job’s ownership to another user.
To do so, go to the job details at the top, click on the kebab
icon and select Sharing options
from the menu.

A dialog box appears where you can select the groups you want to share the job with, grant read access to all authenticated users, or transfer the job’s ownership to another user:

After clicking Save
, you’ll be asked to confirm your choices. A transfer of ownership or change of groups may cause you to lose access to this job. Please review whether this is the case and, if so, whether it’s intended.
Remediating datasets that failed the privacy checks
If the dataset failed one or more privacy checks, then you can try the following remedial actions:
-
In your job’s general settings, if you specified the number of training subjects in your original job, then you can try to increase this value.
-
When configuring your tables, you can:
-
switch the
Optimize for
setting fromAccuracy
toSpeed
. -
reduce the number of training epochs by specifying a lower number than the number of epochs used for training the synthetization model.
-
-
Super Admins can reduce the number of network parameters by reducing the number of regressor, history, and context units in the global job settings.
Remediating columns with poor synthetic data accuracy
The QA report’s accuracy chart compares the synthetic and target data, showing the similarity of the univariate and bivariate distributions. The teal-colored cells represent well-kept distributions (accuracy of 90% or more), whereas light-grey colored cells show a systematic difference between synthetic and target data.
You may improve the accuracy of poorly covered features by selecting a better-suited encoding type or refining their configuration. To do so, you can use the following points as a guide:
-
For columns with numerical values, please consider whether they represent a fixed set of possible values, such as postal codes or clothing sizes, or whether they may vary in the synthetic data, such as weight and height. Use the Categorical or Numerical encoding type, respectively.
-
Date and time values may not have been detected by MOSTLY AI. They could thus have been synthesized using the
Categorical
instead of theDatetime
encoding type. You can verify whether this is the case by looking for rare category protection artifacts, such as the*
label in the bivariate distributions.To mitigate this issue, consider preprocessing these values so that they adhere to the supported datetime formats. -
If the original data has a low number of subjects, consider adjusting the rare category protection accordingly. However, please be aware that setting the threshold lower than 20 may introduce privacy risks.
Getting support from MOSTLY AI
In the event that you have persistent privacy security or other data quality issues, or your tables keep failing despite remedial actions of your system administrator, you’re more than welcome to raise the issue to your MOSTLY AI account manager.
To do so, describe the issue and attach the logs and settings of the poorly performing job to this message. You can easily download these documents by following these simple steps below:
-
Click on
Jobs
in the left side main menu and scroll down to thePrevious jobs
section. -
Look up your job and click on the three-dot menu icon on the rightmost side of the row.
-
Next, click on
Download job settings
and then onDownload job logs
. The documents will download to the default download location of your browser.