Guides
Configure a synthetic dataset

Configure a synthetic dataset

Before you click Create a synthetic dataset to start synthesizing data, you can configure the synthetic dataset settings.

For each synthetic dataset, the settings are available in the Tables, Data settings, and Output settings tabs.

Synthetic dataset settings tabs

For reference, the table below lists all available settings in each tab.

TabSettings and available actions
Tables• Add tables to the synthetic dataset
• Remove tables
Training settings for each table
Data settings• List of tables in the synthetic dataset
• List of columns in each table
• Edit multiple columns
• Include / exclude columns from the synthetic dataset
• Generation method for each column
    • AI                 -> Encoding type
    • Primary key -> Primary key type
    • Foreign key -> Foreign key type
    • Mock data  -> Data type
• Privacy protection options
    • Rare category protection
    • Extreme value protection
• Data augmentation options
    • Generation mood
    • Data rebalancing
    • Smart imputation
Output settings• Data destination
• Specify the number of records to generate for each subject table

Training settings

MOSTLY AI trains a separate AI model for each table in your synthetic dataset. For each table, you can define the training settings for each AI model.

The training settings for each table are available in the Tables tab for a synthetic dataset. Click a table in the Tables tab to open the Training settings drawer which contains all training settings.

Tables - Training settings
SettingDescription
NameThe table name as it will appear in the generated synthetic data.
Training goalAccuracy
Select Accuracy to achieve the highest attainable synthetic data accuracy.
When you select Accuracy, MOSTLY AI sets Maximum training epochs to 100.
SpeedSelect Speed to generate accurate synthetic data using significantly shorter training times.
When you select Speed, MOSTLY AI auto-updates the following training settings:
* Maximum training epochs is set to 10
* Training size is set to 100,000
TurboSelect Turbo if you need to generate a synthetic dataset quickly and accuracy is not a concern. With Turbo, you run a synthetic dataset with only one epoch (or only one cycle of training of the machine learning model with all data).
When you select Turbo, MOSTLY AI auto-updates the following training settings:
* Maximum training epochs is set to 1
* Training samples is set to 10,000
Training sizeSpecify the number of records from the original table that you want to use for the AI model training for this table.
If you leave Training size blank for a table, all records in the original table are used for AI model training. If you specify a lower number of records than are available in the original table, you can speed up the training time but this can reduce the accuracy of the synthetic data.
Maximum training epochsA training epoch is a cycle of training during which the model updates its internal parameters with the entire dataset. After each epoch, the model evaluates its errors and after it finds an epoch that has the lowest error levels, it selects its training parameters as the most optimal for the model.
You can limit the maximum number of training epochs to reduce training time. This comes at the cost of accuracy.
Model sizeThe model size determines the overall number of the parameters that the AI model uses during training. A larger model uses a higher number of parameters.
You can adjust the model size if the synthetic dataset runs into memory issues, takes too long to complete, or produces synthetic data with less than the desired accuracy.
A smaller model size requires less memory, runs faster, and reduces synthetic data accuracy, whereas a bigger model size increases accuracy, requires more memory, and takes more time to complete.
Batch sizeBatch size refers to the number of records used for each training step. Selecting a larger batch size can speed up training, but it consumes more memory and it can decrease accuracy.

Relationship settings

Configure relationships between two tables

If you want to create a synthetic version of a two-table dataset, the dataset typically has a subject and a linked table. In the synthetic dataset settings, you need to define the foreign key relationship for MOSTLY AI to identify the linked table in the dataset.

Prerequisites

  • Prepare a two-table dataset in which one of the tables is a subject table and the other is a linked table with time series data. For example, you can download the Baseball dataset.
  • Upload the Baseball (or another two-table dataset) to a cloud storage bucket.
  • Create a connector to the cloud storage bucket.

Steps

  1. On the MOSTLY AI Home page, upload the players.csv table in the Upload files area. Home page - Upload baseball players.csv
  2. Click Proceed. Home page - click Proceed after upload
  3. On the Synthetic datasets / Start job screen, click Add table. Synthetic datasets - click Add table
  4. In the Add table drawer, upload the seasons.csv table. Start synthetic dataset - Add table - upload seasons.csv
  5. Click Proceed.
  6. On the left, select Data settings.
  7. (Optional) Set the primary key on the players table.

    In step 8, in which you set the foreign key, you also need to define the primary key of the subject table.

    Therefore, the setting of the primary key on the subject (as explained in step 7 here) is not mandatory, but you can use it as a reference.

    1. Select the players table.
    2. Click the id row. Data settings - select id row
    3. In the Settings for `id` drawer, for Generation method select Primary key.
    4. For Generation format, select Sequential.
    5. Click Save. Data settings - primary key config for id row Step result: The primary key for the players table is now set on the id column.
  8. Set a foreign key in the seasons table.
    1. Select the seasons table.
    2. Click the players_id row. Data settings - select players_id row
    3. In the Settings for `players_id` drawer, for Generation method select Foreign key.
    4. For Foreign key type, select Context.
    5. For Parent table, select players
    6. For Parent primary, MOSTLY AI select the configure primary key id automatically. Data settings - select id row

Result

The foreign key set on the players_id table defines the seasons table as the linked table.

Seasons is now a linked table

If you synthesize the configuration, the generated synthetic dataset will fully retain the referential integrity and the correlations between the players and seasons tables.

What's next

You can now click Create a synthetic dataset to to generate a synthetic version of the baseball dataset.

Configure relationships between multiple tables

Relational databases often contain many tables with complex relationships. In the steps below, you will find an illustration of how to configure the relationships of a linked table that has foreign keys to multiple subject tables and how that impacts the referential integrity and correlations between the tables.

About this task

The steps below demonstrate the relationships configuration with a fictional Orders management database. In the example, one of the tables is a linked table and you can see how to create foreign keys to multiple subject tables.

Steps

  1. Open a database or a cloud storage catalog with multiple tables.
  2. Select Data settings. Database catalog - select Data settings
  3. For example, select the OrderLineItem table. Three of the column names in OrderLineItem suggest that they have foreign keys set to other tables: order_id, product_id, and order_status_id.

    Note
    When you need to define multiple foreign keys in a linked table, you can configure only one of them as the Context and the rest as Smart Select.

    A Context foreign key helps to completely preserve the referential integrity and correlations between the linked and subject table. However, you can have only one Context foreign key per table.

    A Smart Select foreign key helps to maintain the referential integrity between two tables but does not guarantee complete retention of the correlations between the two tables in the synthetic data.

    With this in mind, if you have a table that has foreign keys to multiple other tables, you need to select which foreign key you want to set as the Context and leave the remaining as Smart Select.

    For this example, the next steps demonstrate how to set order_id as the Context and product_id as a Smart Select foreign key.

  4. Set a Context foreign key for the order_id column.
    1. Click the order_id column. Data settings - select a table row Step result: The Settings for `order_id` drawer opens.
    2. For Generation method, select Foreign key. Database catalog - select Foreign key Step result: The UI controls below change to support the foreign key configuration.
    3. For Foreign key type, select Context.
    4. For Parent table, select the subject table. In this case, select Order.
    5. For Parent primary key, select the subject table primary key. In this case, select id. Database catalog - Context foreign key configuration
    6. Click Save.
  5. Set a Smart Select foreign key for the product_id column.
    1. Click the product_id column. Data settings - select product_id row Step result: The Settings for `product_id` drawer opens.
    2. For Generation method, select Foreign key.
    3. For Foreign key type, select Smart Select.
      💡

      Because you already set the Context foreign key on order_id, you can now only select Smart Select.

    4. For Parent table, select the subject table. In this case, select Product.
    5. For Parent primary key, select the subject table primary key. In this case, select id. Database catalog - Smart Select foreign key configuration
    6. Click Save.

Result

When MOSTLY AI generates the synthetic data, it fully retains both the correlations and referential integrity between the OrderLineItem table and the Order table because of the Context foreign key configuration.

Also, the referential integrity is fully maintained between OrderLineItem and the Product tables. However, the correlations between OrderLineItem and Product are retained on a best effort basis but full retention is not possible.

Review the table below for a summary of the relationship configuration and its impact on the generated synthetic data.

OrderLineItem columnForeign key typeParent tableRetention of referential integrity between linked and subject tablesRetention of correlations between linked and subject tables
order_idContextOrderYesYes
product_idSmart SelectProductYesBest effort, but not fully retained

View relationship diagram

While you are configuring relationships between tables, you can review the overall relationship diagram from the Tables page.

Steps

On the Tables page, click the Relationship diagram button.

Database catalog - Relationship diagram button

Result

The relationship diagram appears in a modal window. The relationship diagram shows all Context and Smart Select foreign keys that you set between tables.

Database catalog - Relationship diagram

Output settings

Configure a data destination

To start a new synthetic dataset, you need to define a destination in the Output settings > Data destination drop-down menu.

Prerequisites

  • Database destinations
    • Prepare a database to deliver the generated synthetic data.
    • Create a connector that points to the database.
    • Your credentials for the database must have permissions to create new tables.
    • Make sure that your destination database does not have tables with the same name as the tables in your catalog. Otherwise, the delivery to the database will fail for any existing tables.
  • Cloud storage destinations
    • Prepare a cloud storage bucket to deliver the generated synthetic data.
    • Create a connector that points to the cloud bucket.
    • Your credentials for the cloud bucket must have permissions to create new folders and files.

Steps

  1. Start a new synthetic dataset.
  2. Select Output settings.
  3. From Data destination, select an existing connector that points to the database or cloud bucket destination where you want to deliver the synthetic data. Output settings - Data destination

What's next

You can click Create a synthetic dataset to start. As part of the generation of the synthetic dataset, MOSTLY AI delivers the synthetic data to the selected destination.

Drop tables in the destination

If you set a database connector as the destination for your synthetic data, MOSTLY AI delivers the synthetic data only if no tables with the same names exist. If such tables already exist, MOSTLY AI cannot complete the delivery of data.

In such cases, you can enable the Drop tables in the destination checkbox in Output settings. MOSTLY AI will drop only tables that match the table names from the current synthetic dataset configuration.

Tables are dropped before running the AI model training and data generation steps. If the synthetic dataset creation fails during these steps, the dropped tables will no longer be available in your destination database.

Prerequisites

Make sure that the credentials defined in the database connector have the permissions to drop tables from the database. If necessary, check with your Database Administrator.

Steps

  1. From Data destination, select an existing database connector.
    💡

    The Drop tables in destination checkbox is not available for cloud storage destinations.

  2. Select the Drop tables in the destination checkbox. Drop tables in the destination

Result

When you click Create a synthetic dataset, MOSTLY AI connects to the database and drops any tables that match in name the tables from the current synthetic dataset creation. If such tables exist, MOSTLY AI drops the tables and proceeds to generate synthetic data.

If the synthetic dataset creation is successful, MOSTLY AI delivers the synthetic data to the destination database.