Before downloading and sharing the resulting synthetic data, it’s important to evaluate whether the data is actually privacy-safe.

The synthetic data’s QA report appears once the synthetization job is completed. It consists of an executive summary and the report’s detailed privacy and accuracy charts, which are accessible by clicking on View full report at the top-right of the summary.

QA report

Use the Quality Assurance Report for .. dropdown menu to select the table you want to see the executive summary of. It provides general information about their synthetic data accuracy, whether they passed the privacy tests, the number and type of columns, and the number of generated subjects.

The green Accuracy and Privacy tests checkmarks indicate whether the synthetization was successful and that you can share the synthetic data across your business and partnerships.

Synthetic datasets that failed the privacy tests are not privacy-safe. If you have a failed dataset, please rerun your job using the remediating steps below.

Viewing the privacy and accuracy charts

The privacy and accuracy charts are organized by the following tabs:

  • Privacy

  • Univariate distributions

  • Correlations

  • Bivariate distributions

  • Accuracy

QA report

You can learn more about specific data points by hovering your mouse pointer over a chart. Each of the univariate and bivariate charts also have an expand button at their top-right corner, which you can click to see a larger version of them.

You can find a detailed explanation of our privacy and accuracy metrics in the Reading the QA report guide.

Viewing the linked table accuracy charts

The linked table’s QA report contains some charts in the Univariate distributions and Bivariate distributions tabs that can help you assess the accuracy of the sequence length and the correlations between the context’s tables columns and the linked table’s columns.

To look up the sequence length distribution, go to the Univariate distributions tab and type Sequence length in the search bar. The chart will appear automatically.

Linked table univaruate distributions

To look up the context table-linked table column correlations, go to the Bivariate distributions and type context in the search bar, and select the context table columns you want to see the distributions of.

Linked table bivariate distribution

Linked table bivariate distributions

Downloading the QA report

You can also download an HTML version of the entire QA report.
To do so, go to the job details at the top, click on the kebab icon on the right, and select Download QA report from the menu.


Downloading your synthetic data

The following buttons appear in the job summary once MOSTLY AI finished the synthetization job:

  • View QA report in the Analysis pane

  • Download synthetic data and 'Generate more data` in the Generating pane.

Buttons overview

  1. First, click on View QA report to learn about the privacy and accuracy of the resulting synthetic dataset.

  2. Next, click on Download synthetic data. Pressing this button will download the synthetic data to your computer.

    The synthetic dataset will be in the same format as the original, either CSV or Parquet. If the synthetic data is in Parquet format, it may be partitioned into different files. This is a feature of the Parquet format that will help with efficient processing in downstream tasks.

Please read the QA report and check whether the resulting synthetic dataset passed all privacy checks before sharing it.

Sharing your job

You can share your job, including its synthetic data and the QA report, with other user groups. The sharing options also let you grant read access to all authenticated users or transfer the job’s ownership to another user.

To do so, go to the job details at the top, click on the kebab icon and select Sharing options from the menu.

Sharing options kebab menu

A dialog box appears where you can select the groups you want to share the job with, grant read access to all authenticated users, or transfer the job’s ownership to another user:

Sharing options kebab menu

After clicking Save, you’ll be asked to confirm your choices. A transfer of ownership or change of groups may cause you to lose access to this job. Please review whether this is the case and, if so, whether it’s intended.

Remediating datasets that failed the privacy checks

If the dataset failed one or more privacy checks, then you can try the following remedial actions:

  • In your run’s general settings, if you specified the number of training subjects in your original job, then you can try to increase this value.

  • When configuring your tables, you can:

    • switch the Optimize for setting from Accuracy to Speed.

    • reduce the number of training epochs by specifying a lower number than the number of epochs used for training the synthetization model.

Remediating columns with poor synthetic data accuracy

The QA report’s accuracy chart compares the synthetic and target data, showing the similarity of the univariate and bivariate distributions. The teal-colored cells represent well-kept distributions (accuracy of 90% or more), whereas light-grey colored cells show a systematic difference between synthetic and target data.

You may improve the accuracy of poorly covered features by selecting a better-suited encoding type or refining their configuration. To do so, you can use the following points as a guide:

  • For columns with numerical values, please consider whether they represent a fixed set of possible values, such as postal codes or clothing sizes, or whether they may vary in the synthetic data, such as weight and height. Use the Categorical or Numerical encoding type, respectively.

  • Date and time values may not have been detected by MOSTLY AI. They could thus have been synthesized using the Categorical instead of the Datetime encoding type. You can verify whether this is the case by looking for rare category protection artifacts, such as the * label in the bivariate distributions.To mitigate this issue, consider preprocessing these values so that they adhere to the supported datetime formats.

Other actions

Further below, you’ll find an Actions section. Here, you can choose to delete the run or generate more data from the trained synthetic data generation model.

Generate more data

Getting support from MOSTLY AI

In the event that you have persistent privacy security or other data quality issues, or your tables keep failing despite remedial actions of your system administrator, you’re more than welcome to raise the issue to your MOSTLY AI account manager.

To do so, describe the issue and attach the logs and settings of the poorly performing job to this message. You can easily download these documents by following these simple steps below:

  1. Click on Jobs in the left side main menu and scroll down to the Previous jobs section.

  2. Look up your job and click on the three-dot menu icon on the rightmost side of the row.

  3. Next, click on Download job settings and then on Download job logs. The documents will download to the default download location of your browser.

    Download job logs and settings