Concepts
What is synthetic data?

What is synthetic data?

Synthetic data is artificial data that retains many of the characteristics of real-world data. However, a key characteristic of synthetic data is that it does not correspond to real-world entities (people, organizations, institutions, and others).

What is AI-generated synthetic data?

MOSTLY AI provides capabilities to generate tabular synthetic data.

Tabular synthetic data

Original data

Original data is the real data that organizations collect and that can include information about data subjects, events & time series data, or reference information.

Generative AI model

When you provide your original data in MOSTLY AI, a separate AI model is trained for each of your data tables. Each AI model learns the patterns, correlations, distributions, and dependencies of each data table.

Synthetic data generation

With the trained AI model, MOSTLY AI performs random draws to generate a row of data in the synthetic data table. Because of this, each generated row does not correspond to the same sequential row from the original table. At the same time, the generated synthetic data table is highly accurate in its representation of the original data. The synthetic data retains the patterns, correlations, distributions, and dependencies of the original table data.

Why AI-generated synthetic data?

Modern organizations collect and store substantial amounts of data, that include personal information, that they need to protect to ensure the privacy of their customers and business partners. Regulations (such as GDPR) aim to protect private data and make it particularly hard to make such data available for broader analysis, testing, or sharing.

One of the techniques that has been used over time to unlock data with private information for further utility is to anonymize personal details. Anonymization, however, comes with more than a few challenges (opens in a new tab).

With AI-generated synthetic data, you can:

  • protect the privacy of your data subjects
  • avoid the pitfalls of the error-prone process of anonymizing data
  • generate synthetic data in an automated way by training AI models that learn the characteristics of your original data
  • achieve high accuracy so that you can use your synthetic data as a "drop-in replacement" of your original data
  • make your synthetic data smarter with
    • rebalancing to achieve fairer distributions, reduce bias, or improve model accuracy
    • imputation to replace missing values with meaningful ones generated by AI models
    • temperature and Top P to control how diverse the generated synthetic data should be (for example, boost the creation of outliers and edge cases)
    • fairness to ensure that you generate synthetic data with fair distributions for any subgroup or attribute you specify