To generate synthetic data from a data catalog, click on
Jobs in the left side menu, choose
Data catalog and select the data catalog you want to synthesize.
The same job settings tabs appear as when the data catalog was created. Let’s walk through the steps below to launch your job.
Click on the
Settingstab to optionally adjust the number of training and generated subjects and select a destination for the synthetic data.
If you want to review the job settings, click on the
Column detailstab and browse the settings.
However, you cannot make any changes to them. To do so, select
Data catalogsfrom the left side main menu, and select the data catalog you want to edit.
Launch jobto start the synthetic data generation job.
A page appears that informs you about the status of your job.
Once MOSTLY AI completes the synthetization job, the selected data destination will contain the synthetic version of your dataset.
There are two sections on this page that inform you about the synthetic data generation process:
The top section provides general information:
The name of the job as it was specified during its configuration.
There are four jobs types:
Ad hoc synthesizes a dataset uploaded using the web UI.
Data catalog synthesizes a database or dataset stored in a cloud bucket
or local server.
Generate with subject count creates a specified number of new synthetic
subjects from a previous job’s readily trained AI model.
Generate with seed creates a linked table for an uploaded subject table
using a previous job’s readily trained AI model.
This field indicates when the original dataset was uploaded.
Job summarysection tells you which stage the synthesization process is in.
Your dataset goes through the following six stages before you can download the synthetic version:
MOSTLY AI received your dataset and run configuration.
Compute resources are being allocated to your run.
Your dataset is analyzed for its data types and unique values and transformed for efficient processing.
Using generative neural networks, a model is trained to retain your dataset’s granularity, statistical correlations, structures, and time-dependencies.
Without having access to your dataset, MOSTLY AI uses the resulting model to create a synthetic version of your dataset.
The resulting synthetic copy is tested against the original data for accuracy and privacy. It checks for identical information matches and whether the synthetic subjects are dissimilar enough to the original subjects to prevent re-identification. MOSTLY AI discards the original dataset once this stage is completed.
The synthetic data’s QA report appears once the synthetization job is completed. It consists of an executive summary and the report’s detailed privacy and accuracy charts, which are accessible by clicking on
View full report at the top-right of the summary.
The executive summary provides general information about the synthetic dataset’s accuracy, whether it passed the privacy tests, the number and type of columns, and the number of generated subjects.
Privacy tests checkmarks indicate whether the synthetization was successful and that you can share the synthetic data across your business and partnerships.
|Synthetic datasets that failed the privacy tests are not privacy-safe. If you have a failed dataset, please rerun your job using the remediating steps below.|
The privacy and accuracy charts are organized by the following tabs:
You can learn more about specific data points by hovering your mouse pointer over a chart. Each of the univariate and bivariate charts also have an
expand button at their top-right corner, which you can click to see a larger version of them.
|You can find a detailed explanation of our privacy and accuracy metrics in the Reading the QA report guide.|
If your dataset contains a linked table, you’ll find some charts in the
Univariate distributions tab that depict its accuracy. You can recognize them as follows:
This chart depicts the distribution of linked table rows per subject.
|Linked table column charts||
These charts have a
You can also download a PDF verison of the QA report. There are two options to do so:
In the job details at the top, click on the
kebabicon and select
Download QA reportfrom the menu.
Job summary, scroll down to the
Analyzingsection and click on the
Download QA reportbutton.
If the dataset failed one or more privacy checks, then you can try the following remedial actions:
In your job’s general settings, if you specified the number of training subjects in your original job, then you can try to increase this value.
When configuring your tables, you can:
Optimize forsetting from
reduce the number of training epochs by specifying a lower number than the number of epochs used for training the synthetization model.
Super Admins can reduce the number of network parameters by reducing the number of regressor, history, and context units in the global job settings.
The QA report’s accuracy chart compares the synthetic and target data, showing the similarity of the univariate and bivariate distributions. The teal-colored cells represent well-kept distributions (accuracy of 90% or more), whereas light-grey colored cells show a systematic difference between synthetic and target data.
You may improve the accuracy of poorly covered features by selecting a better-suited encoding type or refining their configuration. To do so, you can use the following points as a guide:
For columns with numerical values, please consider whether they represent a fixed set of possible values, such as postal codes or clothing sizes, or whether they may vary in the synthetic data, such as weight and height. Use the Categorical or Numerical encoding type, respectively.
Date and time values may not have been detected by MOSTLY AI. They could thus have been synthesized using the
Categoricalinstead of the
Datetimeencoding type. You can verify whether this is the case by looking for rare category protection artifacts, such as the
*label in the bivariate distributions.To mitigate this issue, consider preprocessing these values so that they adhere to the supported datetime formats.
If the original data has a low number of subjects, consider adjusting the rare category protection accordingly. However, please be aware that setting the threshold lower than 20 may introduce privacy risks.
In the event that you have persistent privacy security or other data quality issues, or your tables keep failing despite remedial actions of your system administrator, you’re more than welcome to raise the issue to your MOSTLY AI account manager.
To do so, describe the issue and attach the logs and settings of the poorly performing job to this message. You can easily download these documents by following these simple steps below:
Jobsin the left side main menu and scroll down to the
Look up your job and click on the three-dot menu icon on the rightmost side of the row.
Next, click on
Download job settingsand then on
Download job logs. The documents will download to the default download location of your browser.