To generate synthetic data in MOSTLY AI, you start a new synthetic dataset. You can view all finished, canceled, failed, and in-progress synthetic datasets on the Synthetic datasets tab.
For each synthetic datasets, you have a list of actions that you can follow.
- Download synthetic data
- Download QA report, logs, or settings
- Share a synthetic dataset
- Delete a synthetic dataset
- Open a synthetic dataset
A synthetic dataset goes through all necessary tasks to generate synthetic data from your original data.
When a synthetic dataset runs, it completes the same tasks for each table in the original data. You can review all tasks per table in the list below.
- Fetch your original data from your data source
- Analyze your original data
- Train an AI model on your original data, its correlations, distributions, and other statistical properties
- Generate a Model QA report (for the trained AI model)
- Generate synthetic data
- Generate a QA report for the synthetic data
- Save synthetic dataset logs
- Save synthetic dataset configuration settings
- Post-process data
- Export data
- Deliver the generated synthetic data to the configured destination
When a synthetic dataset completes, it includes the following list of artifacts.
- Generated synthetic data
- QA report
- Training and generation logs
- Configuration settings logs
You can create a new synthetic dataset with one of two methods: Upload a file or Use a catalog.
With the method Upload a file, you can synthesize multiple uploaded tables. A table of data can span multiple files.
The supported file formats for your uploaded tables are CSV, TSV, and Parquet.
To download the QA report, logs, or configuration settings for a synthetic dataset, use the Download button.
You can download the logs and settings for failed, canceled, finished, and in-progress synthetic datasets.
The QA report is only available for successfully generated synthetic datasets.
- On the Synthetic datasets screen, click the Download button for a synthetic dataset.
- In the pop-up menu, select a synthetic dataset artifact.
You can generate a public URL of a synthetic dataset and share it with anyone.
- Click Share from the Synthetic datasets tab or after you open a synthetic dataset.
- In the drawer, click Copy under Share job using link.
MOSTLY AI generates a new publicly available URL for the synthetic dataset. The URL is copied to your clipboard.
You can now paste and send the URL to people you want to share the synthetic dataset with. When they open the URL, they have access only to specific actions.
- Preview synthetic data
- Download synthetic data
- Download QA report
- Review the QA report contents
Publicly shared synthetic dataset links restrict access to the following actions:
- Generate more data
- Download logs
- Download configuration settings
- Share synthetic data
Even though the Share option is disabled and people cannot generate a new URL from the Share drawer, they can still send the URL you shared with them.
Use the Delete button to delete an existing synthetic dataset.
When you delete a synthetic dataset, all related artifacts are also deleted and cannot be recovered.
- Synthetic data
- QA report
- Logs and configuration settings
- Click Delete job from the Synthetic dataset tab or after you open a job.
- In the confirmation pop-up, click Delete job.
Click a synthetic dataset in the Synthetic dataset tab to open it. You can do so for synthetic datasets of any status. the log and progress of the completed tasks, and review the synthetic dataset configuration.
With a synthetic dataset open, on the Sample data section you can preview up to 100 samples from each generated synthetic table.
Hover over the currently selected table name to switch between tables (if your synthetic dataset includes more than one table). The number of all synthetic data samples in the currently selected table is available in the All samples field.
Click the QA report section to review the QA reports for a synthetic dataset.
For more information, see Read the QA report.
For each table that you synthesize, you can track the tasks as they are completed by MOSTLY AI to analyze, encode, and train the AI model with the table data, and then generate synthetic data with the help of the trained AI model.
- After you open a synthetic dataset, select the Logs section and you can track the synthetic dataset tasks as they make progress.
- (Optional) For a Train AI model task, click the View the training logs button. Step result: You can review the training progress as each epoch completes and, finally, see the selected epoch. It is the one with the lowest validation loss
On the Summary screen after clicking Configuration, you can take a look at the data settings (generation method and Encoding type) for each column of the table.
- After you open a synthetic dataset, select the Configuration section.
- For a table, click the expand button.
The configuration of the table expands and shows the column name and the configured Generation method, Generation mood, as well as whether a column is Rebalanced or Imputed.