lightbulb

You can use this guide as a reference when configuring a job.
Check out the View the job progress guide when running a job.

list

You will need a dataset or a readily configured catalog to complete this guide.
Feel free to download a ready-to-use dataset if you don’t have anything at hand.

clock

It will take 30 mins to complete this guide.
You’ll promptly be sharing your synthetic data across your business and partnerships.

Upload a dataset or select a catalog

On the Jobs page, click Create synthetic data to begin.

Create synthetic data

A new page appears where you can upload your dataset or select a catalog with preconfigured data sources. Take the following actions to do so:

Upload a dataset or select a catalog

green 1 Select the job type

Ad hoc jobs

Lets you upload a dataset from your computer.

Database catalog

Lets you select a catalog with a database data source.

Cloud storage catalog

Lets you select a catalog with a cloud storage data source.

On the server

Lets you select a catalog with data that’s on the server running MOSTLY AI.


green 2 Specify what to synthesize and continue

Depending on whether you selected Ad hoc jobs or one of the catalog options, you can upload your dataset or select a catalog, respectively.

Ad hoc job

Ad hoc job file input

There are two upload areas, one for a subject table and another for its corresponding linked table. If you want to learn more about these table formats, check out the data formatting requirements.

You can also upload tables that are partitioned over different files as long as they have the same schema.

  1. Drag your subject table file(s) to the respective upload area or click mostly plus to use your computer’s file browser. The linked table upload area becomes available once you’ve specified the subject table files.

  2. Use the Table name fields to optionally change how your tables are called.

  3. Click proceed plus to upload your files to the MOSTLY AI server and continue to the job settings.

Catalog

Catalog selection
  1. Select the catalog you want to synthesize.

  2. Click start job to select a data destination and start the job. Or, click edit catalog to review the catalog before starting the job.

Relationships

If there’s one or more subject tables and one or more linked tables in your data, then you can use this tab to specify how they’re linked.

When synthesizing uploaded files or a file-based catalog, MOSTLY AI will automatically link them if the subject table contains a column called id and the first column of the linked table contains _id in its name (for instance, players_id).
Please make sure your tables are correctly linked before proceeding.

If you selected a database catalog in the previous step with no or partially defined relationships, then you can use this tab to specify these.

Let’s take a look at the options that are available to manage the relationships:

Relationship manager

green 1 Table list

This list shows all the tables that will appear in your synthetic data.
They’re sorted by table type. The subject table are at the top, the linked table in the middle, and the reference tables at the bottom.

Click on a table to open the relationship drawer and edit its primary and foreign keys.


green 2 Referenced tables

This part shows which tables are referenced by the tables in the table list.
Clicking on the row opens the relationship drawer of the table in the table list.


green 2 Filter

Filter the relationships view by subject tables, linked tables, reference tables or tables without relations.


green 3 Add, modify, or delete relationships

Hovering over a row reveals the following options:

plus

Opens a wizard for adding referenced tables.

cog

Opens the relationship drawer.

Relationship drawer

Relationship drawer

Primary key and referring tables

  • View, modify or specify the table’s primary key.

  • Click Show referring table to see which tables refer to it.

Foreign keys and referenced tables

  • Clicking add foreign key adds a new row to the list of foreign keys.

  • Use the Foreign key and Referenced table drop-down menus to specify or modify relationships.

trashcan

Removes the relationship.

arrow

Lets you set the primary key of the referenced table.

Data settings

Use this tab to configure how MOSTLY AI processes your database columns during synthesization:

Data settings


green 1 Table list

This list shows all the tables that will appear in your synthetic data.
Click on a table to view its columns and modify the column settings.


green 2 Column list

This list shows the following details:

Include

Include the column in the synthetic data.

Column name

The name of the column.

Generation method

The way a column will be rendered to the synthetic dataset.


green 3 Click cog to open a column’s settings drawer

Relationship drawer

Generation method

  • The way a column will be rendered to the synthetic dataset.

  • All generation methods are fixed, except for AI-powered generation, which you can change to Mock data.

  • Check out the Generation method overview below to learn more about the available configuration options.

Generation mood

Select the degree to which the synthetic version of the column will adhere to the detected distributions and correlations in the original data.

For a list of granular generation moods, see the Generation mood tutorial.

Smart imputation

If the original column contains missing data, these will be imputed for the synthetic data.

Use this column to sort the table
Only available for linked tables

  • Lets you to sort a table by the column of your choice in ascending or descending order.

  • Helps preserve sequential information during the synthesization process.

Generation method overview

Generation method Behavior Roles Configuration options

AI-powered generation

Uses the column for AI-powered synthetic data generation.

  • Subject

  • Linked

Context foreign key

Links the entries in this table to their corresponding entries in the subject table.

  • Linked

  • No available options

Smart Select foreign key

Links the entries in the synthetic version of this table to their entries in the synthetic version of the referenced table.

  • Subject, Linked

Reference foreign key

Links the entries in this table to their corresponding entries in a reference table.

  • Subject, Linked

  • No available options

Mock data

Generates random data within the constraints of the configured data type and format.

  • Subject

  • Linked

Primary key ID

Generates new primary key ID’s for the synthetic version of the table.

  • Subject

  • Sequential

  • UUID

  • UUID no hyphen

  • Hash


green 4 Click edit multiple columns to open the bulk editor.

You can use the bulk editor to configure multiple columns at once.
Tick the checkboxes of the columns you want to configure and use the settings fields in the top row to adjust them.

Column bulk editor

Training settings

Use these settings to specify whether AI model training needs to be done quickly or accurately.
You can also optimize training performance if the results of an earlier job were not of the desired accuracy or took too long to generate.

Let’s take a look at the options on this page:

Relationship manager


green 1 Table list

This list shows all the tables that will appear in your synthetic data.
Click on a table to view or modify its training settings.


green 2 Training settings

The following training settings are available:

Training goal

Select Accuracy to achieve the highest attainable synthetic data accuracy.
Or Speed to deliver accurate synthetic data using significantly shorter training times.

Maximum epochs

This setting allows you to limit the numbers of epochs to, for instance, 2, 5, or 10. This can significantly reduce training time, but comes at the cost of accuracy.

Model size

Adjust the model size if the synthesization job runs into memory issues, takes too long to complete, or produces synthetic data with less than the desired accuracy. Smaller sizes require less memory, run faster, and reduce synthetic data accuracy, whereas bigger sizes increase accuracy, require more memory, and take more time to complete.

Batch size

Batch size refers to the number of records used for each training step. Selecting a larger batch size can speed up training, but consumes more memory and can decrease accuracy.

If you get out of memory errors during training, then you can try to resolve it by decreasing the batch size.


green 3 Click edit multiple tables to open the bulk editor.

You can use the bulk editor to configure multiple tables at once.
Tick the checkboxes of the tables you want to configure and use the settings fields in the top row to adjust them.

Column bulk editor


green 4 Click edit smart select to improve synthetic data accuracy of databases.

To maintain referential integrity, MOSTLY AI needs to make matches between the entries of the referenced and referring tables of Smart Select relationships. By default, these are randomly linked—the foreign key column will be populated with randomly drawn ID’s from the primary key.

You can change this behavior by designating one or more columns of the parent table in a relationship as Smart Select columns. MOSTLY AI can then use these attributes to find appropriate matches with the entries in the referring table of a relationship. This will result in a more accurate rendering of these relationships in the synthetic database.

Smart Select drawer
  1. Click add smart select.column and select a suitable column from the drop-down menu.

  2. Drag to rank the columns by importance.

  3. Click apply to referring tables to completed the configuration. They will be applied to the Smart Select foreign keys of the referring tables.

Output

Data settings


green 1 Select a data destination

Choose a destination from the drop-down menu. You can always download the synthetic data as CSV or Parquet files.


green 2 Optionally specify the size of the synthetic data

Specifying the number of generated subjects will determine the size of the synthetic data.
If you leave these fields blank, MOSTLY AI will use the same number of subjects during training and generation as the original dataset.


green 3 Click start job to start your job

Check out the View the job progress guide to learn more.