"Synthetic data is the perfect input into machine learning and AI" - read the EU report on synthetic data
Read more
Log in
Sign up

What is synthetic data?

The definition of synthetic data describes how synthetic data is generated. Synthetic data is generated by AI trained on real world data samples. The algorithm first learns the patterns, correlations and statistical properties of the sample data. Once trained, the generator can create statistically identical, synthetic data. The synthetic data looks, feels and means the same as the original sample the algorithm was trained on. The synthetic dataset is a perfect proxy for the original, since it contains the same insights and correlations. Synthetic data can be safely used instead of the original data, for example, as training data for building machine learning models and as test data in software testing. Data scientists and data managers are increasingly expected to work with synthetic versions of customer data and to use synthetic data generators for creatings synthetic data assets.

What is AI-generated synthetic data?

AI-generated synthetic data is set to revolutionize how we share, use and build datasets. The first synthetic data sets generated by AI were images. Today, synthetic images are an important part of training computer vision algorithms. The next frontier of the synthetic data revolution is taking place in the field of tabular or structured synthetic data. Companies, governments and researchers using traditional data anonymization techniques, like data masking, where sensitive parts of the data are simply masked or encrypted, have to live with the so-called privacy-utility trade off. The privacy-utility trade off means that the more you anonymize, the less useful the data becomes. AI-generated synthetic data offers a great alternative where privacy is preserved without data utility loss. 

What is synthetic data generation and how does it work?

Synthetic data generation is powered by deep generative algorithms. These algorithms use data samples as training data, learn the correlations, statistical properties and data structures. Once trained, the algorithm can generate data, that is statistically and structurally identical to the original training data, however, all of the datapoints are synthetic. Synthetic data subjects look real, but they are AI-generated and are completely artificial. When generating synthetic data, it's extremely important to prevent the algorithm from overfitting to the original data. In other words, the AI could potentially learn too well and accidentally generate original datapoints. That's why quality check on synthetic data are necessary. There are open source synthetic data generators like SDV, which are typically high maintenance and difficult to control for quality. Commercial solutions offer more robust, quality controlled options. 
Read how synthetic data generators compare

Different types of synthetic data and how they compare

It's important to distinguish between AI-generated synthetic data and mock data. Before the AI revolution, the term synthetic data was used to describe all kinds of generated data, like random or mock data. Even today, a lot of people mistake AI-generated synthetic data for simple mock data, even though the two data types couldn't be more different. 

AI-generated synthetic data is sample based. In order to generate it, you need a large enough sample dataset - at least 100 subjects are required by MOSTLY AI's synthetic data generator. Mock data generators don't require data samples and the resulting data is completely fake with no statistical intelligence contained within. While the AI engine of the synthetic data generator can learn and recreate business rules, mock data generators can't. 

Another important difference is between structured and unstructured synthetic data. An example for unstructured synthetic data are synthetic images and video. Structured synthetic data is tabular data, where datapoints and their relationships are both important properties. 
Comparison of synthetic data types

Why do data scientists and data managers need synthetic data generation tools?

Synthetic data generation tools can offer simple and effective ways for creating meaningful copies of sensitive and valuable data assets, like patient journeys in healthcare or transaction data in banking. These synthetic customer datasets can be shared and collaborated on safely without the burden of bureaucracy, dangers to privacy and loss of data utility. Building AI and machine learning models, providing explainability and governance to AI/ML models calls for shareable, moldable, discardable, flexibly sized and augmented, synthetic data sets.
Learn about synthetic data for Explainable AI

Synthetic data for AI and machine learning is better than real data

“Synthetic data generation accelerates the analytics development cycle, lessens regulatory concerns and lowers the cost of data acquisition.” – Gartner
Synthetic data is the perfect fuel AI and machine learning development projects need. Due to how the synthesization process takes place, the data can be augmented to fit certain characteristics. Synthetic data is created by generative AI algorithms, which can be instructed to create bigger, smaller, fairer or richer versions of the original data. In a way, synthetic data is like modelling clay for data scientists and data managers. For example, upsampling minority groups in a dataset can improve the performance of machine learning models. A huge production data set can be subsetted into more manageable sizes for software testing or supersetted for stress testing. Human bias embedded in the original data can be removed by introducing fairness contraints to the generation processThe possibilities of synthetic data are truly endless. 

Synthetic data use cases for real life solutions

“Gartner predicts that 20% of all test data for consumer-facing use cases will be synthetically generated by 2025.” – Gartner
Good quality synthetic data is an accurate representation of the original data. As a result, it can be used as a drop-in placement for sensitive production data in non-production environments, such as AI training, analytics, software testing and development. Synthetic data is an industry-agnostic solution, used by across various industries from finance and healthcare to insurance and telecommunications. Price and risk prediction models, customer analytics, explainable AI, developer sandboxes, testing, demoing and building personalized products are the most popular synthetic data use cases with many more coming. 
Synthetic data use cases
AI training
Synthetic data for AI training is better than real data. The synthetization process can also augment the data. By upsampling rare events and patterns, AI algorithms can learn more effectively.
Read now
AI governance
Synthetic data for fair and explainable AI systems should be an integral part of every machine learning development. The process of synthetization can remove biases embedded in the original data.
Read now
Synthetic test data
As opposed to rule-based test data, synthetic test data is easy to generate. It is highly realistic and flexibly sized. Synthetic test data is a game-changer in software development and testing.
Read now

Real life examples of synthetic data projects

AI-generated synthetic data is already used by companies to accelerate their data innovations in a privacy-compliant and agile way. Here are the best synthetic data case studies with real life challenges, solutions and quantifiable results:
Read synthetic data case studies
Telefónica, a telecommunications company uses synthetic customer data for analytics,
Erste Bank used synthetic test data to develop a successful mobile banking app,
A large insurance company used synthetic geolocation data to improve home insurance pricing,
A synthetic data sandbox helps speed up data-intensive POCs with third party vendors,
A large telco reduces employee churn by analyzing global synthetic employee data,
A large bank improves the performance of their fraud and anomaly detection model with upsamples synthetic fraud data
data anonymization vs synthetic data

How does synthetic data compare to other data anonymization tools?

Legacy data anonymization technologies not only endanger privacy, but also destroy the utility of the data. Synthetic data is the best technology to use when datapoints don’t need to be linked back to originals. We see a lot of companies using pseudonymization as anonymization. But from a legal perspective, pseudonymised data is still personal data. And it needs to be treated and protected as just that. A pseudonymized dataset still includes so-called direct identifiers. Other tools, like generalization, perform well on the privacy front, but fail to preserve data utility.
Privacy and security

How good is the quality of your synthetic data?

A key question for any synthetic data generator is how accurate its output is. The data synthesis is therefore usually accompanied by an automated quality assurance process. The QA checks if the synthetic data can be trusted to faithfully represent the real world. Each batch of synthetic data generated by MOSTLY AI comes with an automated privacy and accuracy report. We also developed an open source synthetic data benchmarking tool, the Virtual Data Lab. Feel free to use it to gain insights into the quality of the synthetic data our software generates for you!
Check the quality of your synthetic data