To generate a synthetic database from a data catalog, click on
Data catalog in the left side menu and select the data catalog you want to synthesize.
The same job settings tabs appear as when the data catalog was created. Let’s walk through the steps below to launch your job.
Click on the
Settingstab to optionally adjust the number of training and generated subjects for each of the database’s subject tables and select a destination for the synthetic data.
Your job won’t start if the destination already contains the tables you want to synthesize.
These tables need to be manually removed to proceed with this destination.
If you want to review the job settings, click on the
Table detailstab and browse the settings.
However, you cannot make any changes to them. To do so, select
Data catalogsfrom the left side main menu, and select the data catalog you want to edit.
Launch jobto start the synthetic data generation job.
A page appears that informs you about the status of your job.
Once MOSTLY AI completes the synthetization job, the selected data destination will contain the synthetic version of your database.
There are two sections on this page that inform you about the synthetic data generation process:
The top section provides a summary of the run configuration with the following details:
There are four jobs types:
Ad hoc synthesizes a dataset uploaded using the web UI.
Data catalog synthesizes a database or dataset stored in a cloud bucket or local server.
Generate with subject count creates a specified number of new synthetic subjects from a previous job’s readily trained AI model.
Generate with seed creates a linked table for an uploaded subject table using a previous job’s readily trained AI model.
This field indicates when the original dataset was uploaded.
This field shows the number of completed tasks out of the total number of tasks in this synthetization job.
Job summaryinforms you about the synthesization tasks that are currently being performed. The table below the screenshot provides an overview of all the tasks and steps that you will see in this summary.
Task Step Description
Retrieves the table from the database.
Ensures that very large tables can be processed regardless of system memory size.
The table is is analyzed for its data types and unique values and transformed for efficient processing.
Using generative neural networks, a model is trained to retain your dataset’s granularity, statistical correlations, structures, and time-dependencies.
Loading parent table
To synthesize a linked table, its parent table needs to be loaded to memory.
Generating synthetic data
The resulting AI model is used to create a synthetic version of the table.
Analyzing synthetic data for quality and accuracy
The resulting synthetic table is tested against the original for accuracy and privacy. It checks for identical information matches and whether the synthetic subjects are dissimilar enough to the original subjects to prevent re-identification.
Storing table locally
Retrieves the reference table from the database, so that it can be copied to the destination once synthetic data generation is completed.
Assigning foreign keys
Assigning foreign keys using Smart Select
Maintains the referential integrity of the original database in this synthetic version.
Exporting to destination
Exports the resulting synthetic database to the destination.
The synthetic data’s QA report appears once the synthetization job is completed. It consists of an executive summary and the report’s detailed privacy and accuracy charts, which are accessible by clicking on
View full report at the top-right of the summary.
Quality Assurance Report for .. dropdown menu to select the table you want to see the executive summary of. It provides general information about their synthetic data accuracy, whether they passed the privacy tests, the number and type of columns, and the number of generated subjects.
Overall accuracy and
Privacy tests checkmarks indicate whether the synthetization was successful and that you can share the synthetic data across your business and partnerships.
|Please check whether all subject tables passed tne privacy tests before sharing it.|
The privacy and accuracy charts are organized by the following tabs:
You can learn more about specific data points by hovering your mouse pointer over a chart. Each of the univariate and bivariate charts also have an
expand button at their top-right corner, which you can click to see a larger version of them.
|You can find a detailed explanation of our privacy and accuracy metrics in the Reading the QA report guide.|
View QA report for .. dropdown menu, you can see that the QA reports for linked tables are shown as
<linked table> with
Clicking on it shows the executive summary of the linked table.
The linked table’s full report contains some charts in the
Univariate distributions and
Bivariate distributions tabs that can help you assess the accuracy of the sequence length and the correlations between the context’s tables columns and the linked table’s columns.
To look up the sequence length distribution, go to the
Univariate distributions tab and type
Sequence length in the search bar. The chart will appear automatically.
To look up the context table-linked table column correlations, go to the
Bivariate distributions and type
context in the search bar, and select the context table columns you want to see the distributions of.
You can share your job, including its synthetic data and the QA report, with other user groups. The sharing options also let you grant read access to all authenticated users or transfer the job’s ownership to another user.
To do so, go to the job details at the top, click on the
kebab icon and select
Sharing options from the menu.
A dialog box appears where you can select the groups you want to share the job with, grant read access to all authenticated users, or transfer the job’s ownership to another user:
Save, you’ll be asked to confirm your choices. A transfer of ownership or change of groups may cause you to lose access to this job. Please review whether this is the case and, if so, whether it’s intended.
If a table failed the privacy tests, then you can try the following remedial actions:
If you specified the number of training subjects for this table, then you can try to increase this value.
In the table’s training parameters, you can:
Optimize forsetting from
reduce the number of training epochs by specifying a lower number than the number of epochs used for training the synthetization model.
The QA report’s accuracy chart compares the synthetic and target data, showing the similarity of the univariate and bivariate distributions. The teal-colored cells represent well-kept distributions (accuracy of 90% or more), whereas light-grey colored cells show a systematic difference between synthetic and target data.
You may improve the accuracy of poorly covered features by selecting a better-suited encoding type or refining their configuration. To do so, you can use the following points as a guide:
For columns with numerical values, please consider whether they represent a fixed set of possible values, such as postal codes or clothing sizes, or whether they may vary in the synthetic data, such as weight and height. Use the Categorical or Numerical encoding type, respectively.
Date and time values may not have been detected by MOSTLY AI. They could thus have been synthesized using the
Categoricalinstead of the
Datetimeencoding type. You can verify whether this is the case by looking for rare category protection artifacts, such as the
*label in the bivariate distributions.To mitigate this issue, consider preprocessing these values so that they adhere to the supported datetime formats.
In the event that you have persistent privacy security or other data quality issues, or your tables keep failing despite remedial actions of your system administrator, you’re more than welcome to raise the issue to your MOSTLY AI account manager.
To do so, describe the issue and attach the logs and settings of the poorly performing job to this message. You can easily download these documents by following these simple steps below:
Jobsin the left side main menu and scroll down to the
Look up your job and click on the three-dot menu icon on the rightmost side of the row.
Next, click on
Download job settingsand then on
Download job logs. The documents will download to the default download location of your browser.