You can use this guide as a reference when configuring a catalog for your database.
To synthesize a database successfully, you’ll need to have a data source and data destination. Also, you’ll need to have created two data connectors for your source and destination databases before starting this guide. Follow the Connect to your data guide to create these connectors.
It will take 30 mins to complete this guide.
You’ll promptly be sharing your synthetic data across your business and partnerships.
Select a database connector
When creating a new data catalog, you’ll need to specify which data source it’s for.
Click on the Database tab in the left side menu, select an available database, and click Proceed.
Select, classify, and rank tables
At this point, MOSTLY AI analyzed the database’s schema and will walk you through a four-step wizard to specify which tables and views you want to include in the synthetic database and which of them are subject tables, i.e., tables that contain the data subjects whose privacy you want to protect.
Select tables
Select the tables and views you want to include in the synthetic database. Tick the checkbox next to Tables and views to select all tables. You can use the search box at the top to filter the list for specific tables.
Select subject tables
To privacy-secure the database, classify the tables that contain profile information about your data subjects as subject tables. This will privacy-protect them as well as their linked tables, which are tables that contain references to these subjects.
To do so, tick the checkboxes of the tables you want to classify as subject tables, and click Next.
Rank the subject tables
Next, rank the subject tables by their order of importance in the database.
MOSTLY AI will use this ranking to determine how referential integrity will be maintained when a linked table references multiple subject tables in the database. The relationship to the subject table that is ranked higher will take precedence over those further down the list, resulting in a more accurate rendition of that relationship.
Drag the tables up or down to rank them.
Confirm your choices
Review tables you selected and classified as subject tables. The non-classified tables will automatically be classified into linked and referenced tables once you click Proceed.
Relationships
You can use this tab to review your database’s relationships or to specify them if it has no or partially defined relationships.
Let’s take a look at the options that are available to manage the relationships:
Table list
This list shows all the tables that will appear in your synthetic data.
They’re sorted by table type. The subject table are at the top, the linked table in the middle,
and the reference tables at the bottom.
Click on a table to open the relationship drawer and edit its primary and foreign keys.
Referenced tables
This part shows which tables are referenced by the tables in the table list.
Clicking on the row opens the relationship drawer of the table in the table list.
Filter
Filter the relationships view by subject tables, linked tables, reference tables or tables without relations.
Add, modify, or delete relationships
Hovering over a row reveals the following options:
You can use the bulk editor to configure multiple columns at once.
Tick the checkboxes of the columns you want to configure and use the settings fields in the top row to adjust them.
Training settings
Use these settings to specify whether AI model training needs to be done quickly or accurately.
You can also optimize training performance if the results of an earlier job were not of the desired accuracy or took too long to generate.
Let’s take a look at the options on this page:
Table list
This list shows all the tables that will appear in your synthetic data.
Click on a table to view or modify its training settings.
Training settings
The following training settings are available:
Training goal
Select Accuracy to achieve the highest attainable synthetic data accuracy.
Or Speed to deliver accurate synthetic data using significantly shorter training times.
Maximum epochs
This setting allows you to limit the numbers of epochs to, for instance, 2, 5, or 10. This can significantly reduce training time, but comes at the cost of accuracy.
Model size
Adjust the model size if the synthesization job runs into memory issues, takes too long to complete, or produces synthetic data with less than the desired accuracy. Smaller sizes require less memory, run faster, and reduce synthetic data accuracy, whereas bigger sizes increase accuracy, require more memory, and take more time to complete.
Batch size
Batch size refers to the number of records used for each training step. Selecting a larger batch size can speed up training, but consumes more memory and can decrease accuracy.
If you get out of memory errors during training, then you can try to resolve it by decreasing the batch size.
Click to open the bulk editor.
You can use the bulk editor to configure multiple tables at once.
Tick the checkboxes of the tables you want to configure and use the settings fields in the top row to adjust them.
Click to improve synthetic data accuracy of databases.
To maintain referential integrity, MOSTLY AI needs to make matches between the entries of the referenced and referring tables of Smart Select relationships. By default, these are randomly linked—the foreign key column will be populated with randomly drawn ID’s from the primary key.
You can change this behavior by designating one or more columns of the parent table in a relationship as Smart Select columns. MOSTLY AI can then use these attributes to find appropriate matches with the entries in the referring table of a relationship. This will result in a more accurate rendering of these relationships in the synthetic database.
Click and select a suitable column from the drop-down menu.
Drag to rank the columns by importance.
Click to completed the configuration. They will be applied to the Smart Select foreign keys of the referring tables.
Output
Optionally specify the size of the synthetic data
Specifying the number of generated subjects will determine the size of the synthetic data.
If you leave these fields blank, MOSTLY AI will use the same number of subjects during training and generation as the original dataset.
Click to switch to the job settings.
The job settings contain the same settings you configured in the catalog.
Aditionally, in the output settings, you can choose a destination for your synthetic data.
You can always download the synthetic data as CSV or Parquet files.
We use third-party web analytics tools to analyze website usage and measure the success of advertising campaigns. Cookies are set in the process and data is partly transferred to the USA. Further details can be found in our privacy policy.You can revoke or adjust your selection at any time under Settings.
Here you will find an overview of all cookies used. You can give your consent to whole categories or display further information and select certain cookies.