Evaluate synthetic data quality

Evaluate synthetic dataset quality

MOSTLY AI calculates quality metrics for each synthetic dataset. The metrics are available on the page of a synthetic dataset after generation completes and detailed charts and metrics are also available in the Data report for each model.

You can also find the Model report for each table in the generator used for the synthetic dataset. All generator quality metrics are described in Evalute generator quality.

The significant difference in the Data report is to look at if and how the generation features Temperature, Top P, Imputation, Rebalancing, Seeded generation impact the distribution in any columns they may have impacted.

Data insights

In the Data insights section, you can find information about how the synthetic dataset was generated. This includes information about the original total rows, generated rows, and if the generation is representative of the original data or the it used features that can alter the distribution in the synthetic data.

You can also open the Model and Data reports for each model used in the generation.

MOSTLY AI - Synthetic datasets - Data insights

Original total rows

The Original total rows indicates how many rows were in the table in the original dataset.

Generated rows

The Generated rows indicates how many rows were generated in the synthetic dataset.

Generation

Generation indicates if the generated synthetic data is Representative or if it was Modified.

  • Representative. The synthetic data is generated to be representative of the original data.
  • Modified. When the generation is Modified, it means that a generation feature was used that might impact the distribution in the synthetic data when compared to the original data. Such features are:
    • Temperature and Top P. Both control how creatively or conservatively data is generated. Non-default values can impact distributions.
    • Rebalancing. When used, it impacts distributions in your synthetic dataset.
    • Imputation. It replaces null or missing values with meaningful values in a column and it can impact its distribution.
    • Generate with seed. If you generate with a seed dataset that includes changed distributions in one or more columns, this impacts the synthetic dataset distributions.

You can hover over the Modified generation to see the list of features that were used in the generation.

MOSTLY AI - Synthetic datasets - Data insights - Generation modified

Data report

After a synthetic dataset is generated, the model and data reports are available under Data insights for each model in the used generator.

MOSTLY AI - Synthetic datasets - open Data report

The Data report for synthetic datasets contains the following sections which are identical to the section in the Model report (excluding the Accuracy and Distances sections).

  • Dataset statistics
  • Correlations
  • Univariate distributions
  • Bivariate distributions
  • Coherence / Auto-correlations (linked tables only)

Data insights on modified generation

If you generated your synthetic dataset with one of the features that modifies generation, you can observe the impact of the settings in the Data report. For example, you can see how the Univariate distribution changes for a column that is the subject of rebalancing in the image below.

The image shows the Univariate distribution in the country column of the players table from the Baseball dataset.

In this case, the country column was rebalanced to contain 25% France which causes the spike on the left side of the binned chart and the reduction of the USA values on the right hand side.

Note: The synthetic dataset distribution appears in green, and the original distribution appears in black.

Model report - top section - Dataset statistics, Accuracy, Distances