Guides
Synthetic datasets

Synthetic datasets

To generate synthetic data in MOSTLY AI, you start a new synthetic dataset. You can view all finished, canceled, failed, and in-progress synthetic datasets on the Synthetic datasets tab.

For each synthetic datasets, you have a list of actions that you can follow.

What is a synthetic dataset?

A synthetic dataset goes through all necessary tasks to generate synthetic data from your original data.

When a synthetic dataset runs, it completes the same tasks for each table in the original data. You can review all tasks per table in the list below.

  • Fetch your original data from your data source
  • Analyze your original data
  • Train an AI model on your original data, its correlations, distributions, and other statistical properties
  • Generate a Model QA report (for the trained AI model)
  • Generate synthetic data
  • Generate a QA report for the synthetic data
  • Save synthetic dataset logs
  • Save synthetic dataset configuration settings
  • Post-process data
  • Export data
  • Deliver the generated synthetic data to the configured destination

When a synthetic dataset completes, it includes the following list of artifacts.

  • Generated synthetic data
  • QA report
  • Training and generation logs
  • Configuration settings logs

Methods to create synthetic datasets

You can create a new synthetic dataset with one of two methods: Upload a file or Use a catalog.

Upload a file

With the method Upload a file, you can synthesize multiple uploaded tables. A table of data can span multiple files.

The supported file formats for your uploaded tables are CSV, TSV, and Parquet.

MOSTLY AI also provides experimental support for JSON Lines (opens in a new tab), Feather (opens in a new tab), and ORC (opens in a new tab) formats.

Use a catalog

With the Use a catalog method, you can synthesize multi-table data from databases and multi-table datasets that you host on cloud object storage providers.

Download synthetic data

See Download a synthetic dataset.

Download QA report, logs, or settings logs

To download the QA report, logs, or configuration settings for a synthetic dataset, use the Download button.

💡

Tip
You can download the logs and settings for failed, canceled, finished, and in-progress synthetic datasets.

The QA report is only available for successfully generated synthetic datasets.

Steps

  1. On the Synthetic datasets screen, click the Download button for a synthetic dataset.
  2. In the pop-up menu, select a synthetic dataset artifact. Download QA report logs configuration settings

Share a synthetic dataset

You can generate a public URL of a synthetic dataset and share it with anyone.

Steps

  1. Click Share from the Synthetic datasets tab or after you open a synthetic dataset.
  2. In the drawer, click Copy under Share job using link. Share synthetic dataset - click Copy

Result

MOSTLY AI generates a new publicly available URL for the synthetic dataset. The URL is copied to your clipboard.

What's next

You can now paste and send the URL to people you want to share the synthetic dataset with. When they open the URL, they have access only to specific actions.

  • Preview synthetic data
  • Download synthetic data
  • Download QA report
  • Review the QA report contents

Publicly shared synthetic dataset links restrict access to the following actions:

  • Generate more data
  • Download logs
  • Download configuration settings
  • Share synthetic data
    ⚠️

    Important
    Even though the Share option is disabled and people cannot generate a new URL from the Share drawer, they can still send the URL you shared with them.

Delete a synthetic dataset

Use the Delete button to delete an existing synthetic dataset.

💡

Important
When you delete a synthetic dataset, all related artifacts are also deleted and cannot be recovered.

  • Synthetic data
  • QA report
  • Logs and configuration settings

Steps

  1. Click Delete job from the Synthetic dataset tab or after you open a job.
  2. In the confirmation pop-up, click Delete job. Synthetic datasets - click Delete

Open a synthetic dataset

Click a synthetic dataset in the Synthetic dataset tab to open it. You can do so for synthetic datasets of any status. the log and progress of the completed tasks, and review the synthetic dataset configuration.

Preview synthetic data

With a synthetic dataset open, on the Sample data section you can preview up to 100 samples from each generated synthetic table.

Hover over the currently selected table name to switch between tables (if your synthetic dataset includes more than one table). The number of all synthetic data samples in the currently selected table is available in the All samples field.

Preview synthetic data

View the QA report

Click the QA report section to review the QA reports for a synthetic dataset.

Share job - click Copy

For more information, see Read the QA report.

View and track the synthetic dataset progress

For each table that you synthesize, you can track the tasks as they are completed by MOSTLY AI to analyze, encode, and train the AI model with the table data, and then generate synthetic data with the help of the trained AI model.

Steps

  1. After you open a synthetic dataset, select the Logs section and you can track the synthetic dataset tasks as they make progress. Track progress - select Logs
  2. (Optional) For a Train AI model task, click the View the training logs button. Track progress - click View training logs Step result: You can review the training progress as each epoch completes and, finally, see the selected epoch. It is the one with the lowest validation loss Track progress - Track training epochs

View column configuration

On the Summary screen after clicking Configuration, you can take a look at the data settings (generation method and Encoding type) for each column of the table.

Steps

  1. After you open a synthetic dataset, select the Configuration section.
  2. For a table, click the expand button.
Track progress - click View training logs

Result

The configuration of the table expands and shows the column name and the configured Generation method, Generation mood, as well as whether a column is Rebalanced or Imputed.

View column details