lightbulb

You can use this guide as a reference when configuring a catalog for your data that is stored on cloud buckets.

list

To create a catalog for your dataset, you’ll need to have a cloud storage bucket that contains the data you want to synthesize. Follow the Connect to your data guide to connect your bucket to MOSTLY AI.

clock

It will take 30 mins to complete this guide.
You’ll promptly be sharing your synthetic data across your business and partnerships.

Select a data source

When creating a new data catalog, you’ll need to specify which data source it’s for.
Choose On the server to creata a catalog for data that’s stored on MOSTLY AI’s server.
Or choose Cloud storage, select a connector, and click Proceed.

Select tables

Specify the location of your tables

MOSTLY AI will now ask you to specify the location of your tables.

Select tables
  1. In the Subject table field, enter the directory where the subject table is stored.

  2. In addition, you can also choose to specify the location of a linked table. This allows you to process lists, sequential data, or time-series data. Here you can think of online shopping carts, buyer journeys, purchase histories, or financial transactions.

  3. Use the Alias fields to optionaly change the table names.

Relationships

If there’s one or more subject tables and one or more linked tables in your data, then you can use this tab to specify how they’re linked.

When synthesizing uploaded files or a file-based catalog, MOSTLY AI will automatically link them if the subject table contains a column called id and the first column of the linked table contains _id in its name (for instance, players_id).
Please make sure your tables are correctly linked before proceeding.

Let’s take a look at the options that are available to manage the relationships:

Relationship manager

green 1 Table list

This list shows all the tables that will appear in your synthetic data.
They’re sorted by table type. The subject table are at the top, the linked table in the middle, and the reference tables at the bottom.

Click on a table to open the relationship drawer and edit its primary and foreign keys.


green 2 Referenced tables

This part shows which tables are referenced by the tables in the table list.
Clicking on the row opens the relationship drawer of the table in the table list.


green 2 Filter

Filter the relationships view by subject tables, linked tables, reference tables or tables without relations.


green 3 Add, modify, or delete relationships

Hovering over a row reveals the following options:

plus

Opens a wizard for adding referenced tables.

cog

Opens the relationship drawer.

Relationship drawer

Relationship drawer

Primary key and referring tables

  • View, modify or specify the table’s primary key.

  • Click Show referring table to see which tables refer to it.

Foreign keys and referenced tables

  • Clicking add foreign key adds a new row to the list of foreign keys.

  • Use the Foreign key and Referenced table drop-down menus to specify or modify relationships.

trashcan

Removes the relationship.

arrow

Lets you set the primary key of the referenced table.

Data settings

Use this tab to configure how MOSTLY AI processes your database columns during synthesization:

Data settings


green 1 Table list

This list shows all the tables that will appear in your synthetic data.
Click on a table to view its columns and modify the column settings.


green 2 Column list

This list shows the following details:

Include

Include the column in the synthetic data.

Column name

The name of the column.

Generation method

The way a column will be rendered to the synthetic dataset.


green 3 Click cog to open a column’s settings drawer

Relationship drawer

Generation method

  • The way a column will be rendered to the synthetic dataset.

  • All generation methods are fixed, except for AI-powered generation, which you can change to Mock data.

  • Check out the Generation method overview below to learn more about the available configuration options.

Generation mood

Select the degree to which the synthetic version of the column will adhere to the detected distributions and correlations in the original data.

For a list of granular generation moods, see the Generation mood tutorial.

Smart imputation

If the original column contains missing data, these will be imputed for the synthetic data.

Use this column to sort the table
Only available for linked tables

  • Lets you to sort a table by the column of your choice in ascending or descending order.

  • Helps preserve sequential information during the synthesization process.

Generation method overview

Generation method Behavior Roles Configuration options

AI-powered generation

Uses the column for AI-powered synthetic data generation.

  • Subject

  • Linked

Context foreign key

Links the entries in this table to their corresponding entries in the subject table.

  • Linked

  • No available options

Smart Select foreign key

Links the entries in the synthetic version of this table to their entries in the synthetic version of the referenced table.

  • Subject, Linked

Reference foreign key

Links the entries in this table to their corresponding entries in a reference table.

  • Subject, Linked

  • No available options

Mock data

Generates random data within the constraints of the configured data type and format.

  • Subject

  • Linked

Primary key ID

Generates new primary key ID’s for the synthetic version of the table.

  • Subject

  • Sequential

  • UUID

  • UUID no hyphen

  • Hash


green 4 Click edit multiple columns to open the bulk editor.

You can use the bulk editor to configure multiple columns at once.
Tick the checkboxes of the columns you want to configure and use the settings fields in the top row to adjust them.

Column bulk editor

Training settings

Use these settings to specify whether AI model training needs to be done quickly or accurately.
You can also optimize training performance if the results of an earlier job were not of the desired accuracy or took too long to generate.

Let’s take a look at the options on this page:

Relationship manager


green 1 Table list

This list shows all the tables that will appear in your synthetic data.
Click on a table to view or modify its training settings.


green 2 Training settings

The following training settings are available:

Training goal

Select Accuracy to achieve the highest attainable synthetic data accuracy.
Or Speed to deliver accurate synthetic data using significantly shorter training times.

Maximum epochs

This setting allows you to limit the numbers of epochs to, for instance, 2, 5, or 10. This can significantly reduce training time, but comes at the cost of accuracy.

Model size

Adjust the model size if the synthesization job runs into memory issues, takes too long to complete, or produces synthetic data with less than the desired accuracy. Smaller sizes require less memory, run faster, and reduce synthetic data accuracy, whereas bigger sizes increase accuracy, require more memory, and take more time to complete.

Batch size

Batch size refers to the number of records used for each training step. Selecting a larger batch size can speed up training, but consumes more memory and can decrease accuracy.

If you get out of memory errors during training, then you can try to resolve it by decreasing the batch size.


green 3 Click edit multiple tables to open the bulk editor.

You can use the bulk editor to configure multiple tables at once.
Tick the checkboxes of the tables you want to configure and use the settings fields in the top row to adjust them.

Column bulk editor


green 4 Click edit smart select to improve synthetic data accuracy of databases.

To maintain referential integrity, MOSTLY AI needs to make matches between the entries of the referenced and referring tables of Smart Select relationships. By default, these are randomly linked—the foreign key column will be populated with randomly drawn ID’s from the primary key.

You can change this behavior by designating one or more columns of the parent table in a relationship as Smart Select columns. MOSTLY AI can then use these attributes to find appropriate matches with the entries in the referring table of a relationship. This will result in a more accurate rendering of these relationships in the synthetic database.

Smart Select drawer
  1. Click add smart select.column and select a suitable column from the drop-down menu.

  2. Drag to rank the columns by importance.

  3. Click apply to referring tables to completed the configuration. They will be applied to the Smart Select foreign keys of the referring tables.

Output

Data settings


green 1 Optionally specify the size of the synthetic data

Specifying the number of generated subjects will determine the size of the synthetic data.
If you leave these fields blank, MOSTLY AI will use the same number of subjects during training and generation as the original dataset.


green 2 Click start job to switch to the job settings.

The job settings contain the same settings you configured in the catalog.
Aditionally, in the output settings, you can choose a destination for your synthetic data.
You can always download the synthetic data as CSV or Parquet files.