What is synthetic data?

What is synthetic data?

Synthetic data is artificial data that retains many of the characteristics of real-world data. However, a key characteristic of synthetic data is that it does not correspond to real-world entities (people, organizations, institutions, and others).

What is AI-generated synthetic data?

MOSTLY AI provides capabilities to generate tabular synthetic data.

Tabular synthetic data

Original data

Original data is the real data that organizations collect and that can include information about data subjects, events & time series data, or reference information.

Generative AI model

When you provide your original data in MOSTLY AI, a separate AI model is trained for each of your data tables. Each AI model learns the patterns, correlations, distributions, and dependencies of each data table.

Synthetic data generation

With the trained AI model, MOSTLY AI performs random draws to generate a row of data in the synthetic data table. Because of this, each generated row does not correspond to the same sequential row from the original table. At the same time, the generated synthetic data table is highly accurate in its representation of the original data. The synthetic data retains the patterns, correlations, distributions, and dependencies of the original table data.

To keep this conceptual introduction brief, the example includes an explanation of a single-table synthetic data. You can also learn more about multi-table synthetic data.

Why AI-generated synthetic data?

Modern organizations collect and store substantial amounts of data, that include personal information, that they need to protect to ensure the privacy of their customers and business partners. Regulations (such as GDPR) aim to protect private data and make it particularly hard to make such data available for broader analysis, testing, or sharing.

One of the techniques that has been used over time to unlock data with private information for further utility is to anonymize personal details. Anonymization, however, comes with more than a few challenges (opens in a new tab).

With AI-generated synthetic data, you can:

  • protect the privacy of your data subjects
  • avoid the pitfalls of the error-prone process of anonymizing data
  • generate synthetic data in an automated way by training AI models that learn the characteristics of your original data
  • achieve high accuracy so that you can use your synthetic data as a "drop-in replacement" of your original data
  • make your synthetic data smarter with
    • data rebalancing to achieve fairer distributions, reduce bias, or improve model accuracy
    • data imputation to replace missing values with meaningful ones generated by AI models
    • generation mood to control how diverse the created synthetic data should be (for example, boost the creation of outliers and edge cases)