Synthetic datasets

To generate synthetic data in MOSTLY AI, you start a new synthetic dataset. You can view all finished, canceled, failed, and in-progress synthetic datasets on the Synthetic datasets page.

What is a synthetic dataset?

A synthetic dataset contains the generated (single- or multi-table) data as well as a number of additional artifacts.

Generated synthetic data (available to download in CSV, Parquet, XLSX formats)
Usage statistics
- Generated data points
- Credits used
Data insights
- Generator quality - Overall, Univariate, Bivariate, Coherence
- Distances
- Model report for the quality of the generator
- Data report for the quality of the synthetic dataset
Data samples - 10 generated samples from the generated data (that you can resample as needed)
Configuration
- JSON dictionary of the synthetic dataset configuration
- Python SDK code to access the synthetic data via Python or Jupyter Notebook

Create a synthetic dataset

For more information, see Generate single- and multi-table synthetic datasets.

Configure a synthetic dataset


Select a compute environment
Set sample size and temperature
Rebalance columns
Impute data
Generate fair synthetic data
Use a seed dataset for conditional generation
Evaluate quality
Deliver to databases and cloud buckets