You can use this guide as a reference when configuring a catalog for your data that is stored on cloud buckets.
To create a catalog for your dataset, you’ll need to have a cloud storage bucket that contains the data you want to synthesize. Follow the Connect to your data guide to connect your bucket to MOSTLY AI.
It will take 30 mins to complete this guide.
You’ll promptly be sharing your synthetic data across your business and partnerships.
Select a data source
When creating a new data catalog, you’ll need to specify which data source it’s for.
Choose On the server to creata a catalog for data that’s stored on MOSTLY AI’s server.
Or choose Cloud storage, select a connector, and click Proceed.
Specify the location of your tables
MOSTLY AI will now ask you to specify the location of your tables.
In the Subject table field, enter the directory where the subject table is stored.
In addition, you can also choose to specify the location of a linked table. This allows you to process lists, sequential data, or time-series data. Here you can think of online shopping carts, buyer journeys, purchase histories, or financial transactions.
Use the Alias fields to optionaly change the table names.
Relationships
If there’s one or more subject tables and one or more linked tables in your data, then you can use this tab to specify how they’re linked.
When synthesizing uploaded files or a file-based catalog, MOSTLY AI will automatically link them if the subject table contains a column called id and the first column of the linked table contains _id in its name (for instance, players_id).
Please make sure your tables are correctly linked before proceeding.
Let’s take a look at the options that are available to manage the relationships:
Table list
This list shows all the tables that will appear in your synthetic data.
They’re sorted by table type. The subject table are at the top, the linked table in the middle,
and the reference tables at the bottom.
Click on a table to open the relationship drawer and edit its primary and foreign keys.
Referenced tables
This part shows which tables are referenced by the tables in the table list.
Clicking on the row opens the relationship drawer of the table in the table list.
Filter
Filter the relationships view by subject tables, linked tables, reference tables or tables without relations.
Add, modify, or delete relationships
Hovering over a row reveals the following options:
You can use the bulk editor to configure multiple columns at once.
Tick the checkboxes of the columns you want to configure and use the settings fields in the top row to adjust them.
Training settings
Use these settings to specify whether AI model training needs to be done quickly or accurately.
You can also optimize training performance if the results of an earlier job were not of the desired accuracy or took too long to generate.
Let’s take a look at the options on this page:
Table list
This list shows all the tables that will appear in your synthetic data.
Click on a table to view or modify its training settings.
Training settings
The following training settings are available:
Training goal
Select Accuracy to achieve the highest attainable synthetic data accuracy.
Or Speed to deliver accurate synthetic data using significantly shorter training times.
Maximum epochs
This setting allows you to limit the numbers of epochs to, for instance, 2, 5, or 10. This can significantly reduce training time, but comes at the cost of accuracy.
Model size
Adjust the model size if the synthesization job runs into memory issues, takes too long to complete, or produces synthetic data with less than the desired accuracy. Smaller sizes require less memory, run faster, and reduce synthetic data accuracy, whereas bigger sizes increase accuracy, require more memory, and take more time to complete.
Batch size
Batch size refers to the number of records used for each training step. Selecting a larger batch size can speed up training, but consumes more memory and can decrease accuracy.
If you get out of memory errors during training, then you can try to resolve it by decreasing the batch size.
Click to open the bulk editor.
You can use the bulk editor to configure multiple tables at once.
Tick the checkboxes of the tables you want to configure and use the settings fields in the top row to adjust them.
Click to improve synthetic data accuracy of databases.
To maintain referential integrity, MOSTLY AI needs to make matches between the entries of the referenced and referring tables of Smart Select relationships. By default, these are randomly linked—the foreign key column will be populated with randomly drawn ID’s from the primary key.
You can change this behavior by designating one or more columns of the parent table in a relationship as Smart Select columns. MOSTLY AI can then use these attributes to find appropriate matches with the entries in the referring table of a relationship. This will result in a more accurate rendering of these relationships in the synthetic database.
Click and select a suitable column from the drop-down menu.
Drag to rank the columns by importance.
Click to completed the configuration. They will be applied to the Smart Select foreign keys of the referring tables.
Output
Optionally specify the size of the synthetic data
Specifying the number of generated subjects will determine the size of the synthetic data.
If you leave these fields blank, MOSTLY AI will use the same number of subjects during training and generation as the original dataset.
Click to switch to the job settings.
The job settings contain the same settings you configured in the catalog.
Aditionally, in the output settings, you can choose a destination for your synthetic data.
You can always download the synthetic data as CSV or Parquet files.
We use third-party web analytics tools to analyze website usage and measure the success of advertising campaigns. Cookies are set in the process and data is partly transferred to the USA. Further details can be found in our privacy policy.You can revoke or adjust your selection at any time under Settings.
Here you will find an overview of all cookies used. You can give your consent to whole categories or display further information and select certain cookies.