Evaluate synthetic data quality

Evaluate synthetic dataset quality

MOSTLY AI calculates synthetic dataset quality metrics for each synthetic dataset. The metrics are available on the page of a synthetic dataset after generation completes and detailed charts and metrics are also available in the Data report for each table.

You can also find the model report for each table in the generator. All metrics are described in Evalute generator quality.

The significant difference in the data report is to look at how rebalancing, temperature, and top P impact the distribution in impacted columns.

Data report

After a synthetic dataset is generated, the model and data reports are available under Data insights for each model in the used generator.

MOSTLY AI - Synthetic datasets - open Data report

The Data report for synthetic datasets contains the following sections which are identical to the section in the Model report (excluding the Accuracy and Distances sections).

  • Dataset statistics
  • Correlations
  • Univariate distributions
  • Bivariate distributions
  • Coherence / Auto-correlations (linked tables only)

Data insights on rebalancing, temperature, and top P

If you generated your synthetic dataset by using one of the flexible generation options for Rebalancing, Temperature, and Top P, you can observe the impact of the settings in the Data report. For example, you can see how the Univariate distribution changes for a column that is the subject of rebalancing in the image below.

The image shows the Univariate distribution in the country column of the players table from the Baseball dataset.

In this case, the country column was rebalanced to contain 25% France which causes the spike on the left side of the binned chart and the reduction of the USA values on the right hand side.

Note: The synthetic dataset distribution appears in green, and the original distribution appears in black.

Model report - top section - Dataset statistics, Accuracy, Distances