Synthetic datasets

Synthetic datasets

To generate synthetic data in MOSTLY AI, you start a new synthetic dataset. You can view all finished, canceled, failed, and in-progress synthetic datasets on the Synthetic datasets page.

What is a synthetic dataset?

A synthetic dataset contains the generated (single- or multi-table) data as well as a number of additional artifacts.

  • Generated synthetic data (available to download in CSV, Parquet, XLSX formats)
  • Usage statistics
    • Generated data points
    • Credits used
  • Data insights
    • Generator quality - Overall, Univariate, Bivariate, Coherence
    • Distances
    • Model report for the quality of the generator
    • Data report for the quality of the synthetic dataset
  • Data samples - 10 generated samples from the generated data (that you can resample as needed)
  • Configuration
    • JSON dictionary of the synthetic dataset configuration
    • Python client code to access the synthetic data via Python or Jupyter Notebook

Create a synthetic dataset

For more information, see Generate single- and multi-table synthetic datasets.

Configure a synthetic dataset

Set sample size and temperature
Rebalance columns
Impute data
Evaluate quality
Deliver to databases and cloud buckets
Use a seed dataset for conditional generation
Fair synthetic data