💡 Download the complete guide to AI-generated synthetic data!
Go to the ebook

Historically, synthetic data has been predominantly used to anonymize data and protect user privacy. This approach has been particularly valuable for organizations that handle vast amounts of sensitive data, such as financial institutions, telecommunications companies, healthcare providers, and government agencies. Synthetic data offers a solution to privacy concerns by generating artificial data points that maintain the same patterns and relationships as the original data but do not contain any personally identifiable information (PII).

There are several reasons why synthetic data is an effective tool for privacy use cases:

  1. Privacy by design: Synthetic data is generated in a way that ensures privacy is built into the process from the beginning. By creating data that closely resembles real-world data but without any PII, synthetic data allows organizations to share information without the risk of exposing sensitive information or violating privacy regulations.
  2. Compliance with data protection regulations: Synthetic data helps organizations adhere to data protection laws, such as the General Data Protection Regulation (GDPR) and the Health Insurance Portability and Accountability Act (HIPAA). Since synthetic data does not contain PII, organizations can share and analyze data without compromising user privacy or breaching regulations.
  3. Collaboration and data sharing: Synthetic data enables organizations to collaborate and share data more easily and securely. By using synthetic data, researchers and analysts can work together on projects without exposing sensitive information or violating privacy rules.

However, recent advancements in technology and machine learning have illuminated the vast potential of synthetic data, extending far beyond the privacy use case. A recent paper from Boris van Breugel and Michaela van der Schaar describes how AI-generated synthetic data moved beyond the data privacy use case. In this blog post, we will explore the potential of synthetic data beyond data privacy applications and the direction in which MOSTLY AI's synthetic data platform has been developing, including new features beyond privacy, such as data augmentation and data balancing, domain adaptation, simulations, and bias and fairness.

Data augmentation and data balancing

Synthetic data can be used to augment existing datasets, particularly when there is just not enough data or there is an imbalance in data representation. Already back in 2020 we showed that by simply generating more synthetic data than was there in the first place, it’s possible to improve the performance of a downstream task.

Since then, we have seen more and more interest in utilizing synthetic data to boost the performance of machine learning models. And there are two distinct approaches that one can take to achieve this: either amplifying existing data by creating more synthetic data (as we did in our research) and only working with the synthetic data or mixing real and synthetic data.

But synthetic data can also help with highly imbalanced datasets. In the realm of machine learning, imbalanced datasets can lead to biased models that perform poorly on underrepresented data points. Synthetic data generation can create additional data points for underrepresented categories, effectively balancing the dataset and improving the performance of the resulting models. We recently published a blog post on data augmentation with details about how our platform can be used to augment existing datasets.

Domain adaptation

In many cases, machine learning models are trained on data from one domain but need to be applied to a different domain where no or not enough training data exists, or where it would be costly to obtain that data. Synthetic data can bridge this gap by simulating the target domain's data, allowing models to adapt and perform better in the new environment. One of the advantages of this approach is that the standard downstream models don’t need to be changed and can be compared easily.

This has applications in various industries. We currently see the most applications of this use case in the unstructured data space. For example, when generating training material for autonomous vehicles, where synthetic data can be generated to simulate different driving conditions and scenarios. Or, similarly, in medical imaging, synthetic data can be generated to mimic different patient populations or medical conditions, allowing healthcare professionals to test and validate machine learning algorithms without the need for vast amounts of real-world data, which can be challenging and expensive to obtain.

However, the same approach and benefits hold true for structured, tabular data as well and it’s an area where we see great potential for structured synthetic data in the future.

Data simulations

But what happens if there is no real-world data at all to work with? Synthetic data can help in this scenario too. Synthetic data can be used to create realistic simulations for various purposes, such as testing, training, and decision-making. Companies can develop synthetic business scenarios and simulate customer behavior.

One example is the development of new marketing strategies for product launches. Companies can generate synthetic customer profiles that represent a diverse range of demographics, preferences, and purchasing habits. By simulating the behavior of these synthetic customers in response to different marketing campaigns, businesses can gain insights into the potential effectiveness of various strategies and make data-driven decisions to optimize their campaigns. This approach allows companies to test and refine their marketing efforts without the need for expensive and time-consuming real-world data collection.

In essence simulated synthetic data holds the potential of being the realistic data that every organization wishes to have: data that is relatively low-effort to create, cost-efficient, and highly customizable. This flexibility will allow organizations to innovate, adapt, and improve their products and services more effectively and efficiently.

Bias and fairness

Bias in datasets can lead to unfair and discriminatory outcomes in machine learning models. These biases often stem from historical data that reflects societal inequalities and prejudices, which can inadvertently be learned and perpetuated by the algorithms. For example, a facial recognition system trained on a dataset predominantly consisting of light-skinned individuals may have difficulty accurately identifying people with darker skin tones, leading to misclassifications and perpetuating racial bias. Similarly, a hiring algorithm trained on a dataset with a higher proportion of male applicants may inadvertently favor male candidates over equally qualified female candidates, perpetuating gender discrimination in the workplace.

Therefore, addressing bias in datasets is crucial for developing equitable and fair machine learning systems that provide equal opportunities and benefits for all individuals, regardless of their background or characteristics.

Synthetic data can help address these issues by generating data that better represents diverse populations, leading to more equitable and fair models. In short: one can generate fair synthetic data based on unfair real data. Already 3 years ago we showed this in our 5-part Fairness Blogpost Series that you can re-read to learn why bias in AI is a problem and how bias correction is one of the main potentials of synthetic data. There we also show the complexity and challenges of the topic including first and foremost how to define what is fair. We see an increasing interest in the market for leveraging synthetic data to address biases and fairness.

There is no question about it: the potential of synthetic data extends far beyond privacy and anonymization. As we showed, synthetic data offers a range of powerful applications that can transform industries, enhance decision-making, and ultimately change the way we work with data. By harnessing the power of synthetic data, we can unlock new possibilities, create more equitable models, and drive innovation in the data-driven world.

Funding for digital healthcare has doubled since the pandemic. Yet, a large gap between digital leaders and life sciences remains. The reason? According to McKinsey's survey published in 2023, a lack of high-quality, integrated healthcare data platforms is the main challenge cited by medtech and pharma leaders as the reason behind the lagging digital performance. As much as 45 percent of these companies' tech investments go to applied artificial intelligence (AI), industrialized machine learning (ML), and cloud and edge computing - none of which can be realized without meaningful data access. 

Healthcare data challenges
Source: Top ten observations from 2022 in life sciences digital and analytics, McKinsey

Why are healthcare data platforms mission-critical? 

Healthcare data is hard to access

Since healthcare data is one of the most sensitive data types with special protections from HIPAA and GDPR, it's no surprise that getting access is incredibly hard. MOSTLY AI's team has been working with healthcare companies and institutions closely, providing them with the technology needed to create privacy safe healthcare data platforms. We know firsthand how challenging it is to conduct research and use data-hungry technologies already standard in other industries like banking and telecommunications. 

Efforts to unlock health data for research and AI applications are already under way on a federal level in Germany. The German data landscape is an especially challenging one. The country is made up of 16 different federal states, each with its own laws and regulations as well as siloed health data repositories.

InGef, the Institute for Applied Health Research in Berlin and other renowned institutes such as Fraunhofer, Berliner Charité and the Federal Institute for Drugs and Medical Devices set out to solve this problem. This multi-year program is sponsored by the Federal Ministry of Health, a public tender MOSTLY AI won in 2022 to support the program with its synthetic data generation capabilities. The goal here is to develop a healthcare data platform to improve the secure use of health data for research purposes with as-good-as-real, shareable synthetic data versions of health records.

According to Andreas Ponikiewicz, VP of Global Sales at MOSTLY AI, 

"Enabling top research requires high quality data that is often either locked away, siloed, or sparse. Healthcare data is considered as very sensitive, therefore the highest safety standards need to be fulfilled. With generative AI based synthetic healthcare data, that contains all the statistical patterns, but is completely artificial, the data can be made available without privacy risk. This makes the data shareable to achieve better collaboration, research outcomes, diagnoses and treatments, and overall service efficiencies in the healthcare sector, which ultimately benefits society overall”.

Protecting health data should be, without a doubt, a priority. According to recent reports, healthcare overtook finance as the most breached industry in 2022, with 22% of data breaches occurring within healthcare companies, up 38% year over year. The so-called deidentification of data required by HIPAA is the go-to solution for many, even though simply removing PII or PHI from datasets never guarantees privacy.

Old-school data anonymization techniques, like data masking, not only endanger privacy but also destroy data utility, which is a major issue for medical research. Finding better ways to anonymize data is crucial for securing data access across the healthcare industry. A new generation of privacy-enhancing technologies is already commercially available and ready to overcome data access limitations. The European Commission's Joint Research Center recommends AI-generated synthetic data for healthcare and policy-making, eliminating data access issues across borders and institutions. 

Healthcare data is incomplete and imbalanced

Research and patient care suffer due to incomplete, inaccurate, and inconsistent data. Filling in the gaps is also an important step in making the data readable for humans. Human readability is especially mission-critical in health research and healthcare, where understanding and reasoning around data is part of the research process. Machine learning models and AI algorithms need to see the data in its entirety too. Any data points masked or removed could contain important intelligence that models can learn from. As a result, heavily masked or aggregated datasets won't be able to teach your models as well as complete datasets with maximum utility.

To overcome these limitations, more and more research teams turn to synthetic data as an alternative. Successfully implemented healthcare data platforms fully take advantage of the potential of synthetic data beyond privacy. Synthetic data generated from real seed data offers privacy as well as data augmentation opportunities perfect for improving the performance of machine learning algorithms. The European Commission's Joint Research Center investigated the synthetic patient data option for healthcare applications closely and found that (the):

"Resulting data not only can be shared freely, but also can help rebalance under-represented classes in research studies via oversampling, making it the perfect input into machine learning and AI models"

Rebalancing with MOSTLY AI's synthetic data platform
Data rebalancing in MOSTLY AI’s synthetic data platform

Healthcare data is biased

Data bias can take many shapes and forms from imbalances to missing or erroneous data on certain groups of people. Synthetic data generation is a promising tool that can help correct biases embedded in datasets during data preparation. The first important step is to find the bias in the first place. The more eyes you have on the data, the better the chances of identifying hidden biases. Clearly, this is impossible with sensitive healthcare datasets. Synthetic versions of data can increase the level of access and transparency of important data assets.

Health data is the most sensitive data type

A few years ago a brand new drug, developed for treating Spinal Muscular Atrophy, broke all previous records as the most expensive therapy in the world. If you were to query supposedly anonymous health insurance databases around that time, you could easily identify children who received this drug, simply because the price was such an outlier. But even in the absence of such an extreme value, reidentifying people by linking separate data points together is not a difficult thing to do.

Surely, locking health data up and securing the environment where the data is stored is the way to go. Except that most data leaks actually originate with a company's own employees. Usually, there is no malicious intent either. A zero-trust approach helps only to an extent and fails to protect from mistakes and accidents that are bound to happen. Locking data up also means fewer data collaboration, smaller sample sizes, less intelligence, higher costs, worse predictions, and ultimately, more suffering.

For example, the improvement of predictive accuracy of machine learning models regarding health outcomes can not only save costs for healthcare providers but also decrease suffering. The more granular the data, the better the capabilities in predicting how certain cohorts of patients will react to treatments and what the likelihood of a good outcome will be.

In one instance, MOSTLY AI's synthetic data generator was used for the synthetic rebalancing of patient records and predicting which patients would benefit from a therapy that can come with serious side effects. John Sullivan, MOSTLY AI's Head of Customer Experience, has seen how synthetic data generation can transform health predictions from up close. 

"We worked on a dataset for a large North American healthcare company and achieved an increase of accuracy in true positives predicted by the down-stream ML model in the range of 7-13% against a target of 5-10%. The improved model performance means potentially hundreds of patients benefit from early identification, and more importantly, early treatment of their illness. It's huge and extremely motivating." 

Considering that 85% of machine learning models don't even make it into production, synthetic training data is a massive win for data science teams and patients alike. 

Synthetic healthcare data types

Different data types are present along the patient journey, requiring a wide range of tools to extract intelligence from these data sources at scale. Healthcare data platforms should cover the entire patient journey, providing data consumers with an environment ready for 'in-silico' experiments.

Healthcare data types
Source: Databricks - Unlocking the power of health data with a modern data lakehouse

In healthcare, image data in particular receives a lot of attention. From improved breast cancer detection to AI-powered surgical guidance systems, medical science is undergoing a profound transformation. However, the AI revolution doesn't stop at image analyses and computer vision. 

Tabular healthcare data is another frontier for new artificial intelligence applications. Synthetic data generated by AI trained on real datasets is one of the most versatile tools with plenty of use cases and a robust track record, allowing researchers to collaborate even across borders and institutions.  

Examples of tabular healthcare data include electronic health records (EHR) of patients and populations, electronic medical records generated by patient journeys (EMR), lab results, and data from monitoring devices. These highly sensitive structured data types can all be synthesized for ethical, patient-protecting usage without utility loss. Synthetic health data can then be stored, shared, analyzed, and used for building AI and machine learning applications without additional consent, speeding up crucial steps in drug development and treatment optimization processes.

Rare disease research suffers from a lack of data the most. Almost always, the only way to produce significant results is to share datasets across national borders and between research teams. This process is sometimes near-impossible and excruciatingly slow at best. Researchers need to be working on the same data in order to make the same conclusions and validate findings. Synthetic versions of datasets can be shared and merged without compliance issues or privacy risks, allowing rare disease research to advance much quicker and at a significantly lower cost. On-boarding researchers and scientists to healthcare data platforms populated with synthetic data is easy to do, since the synthetic version of datasets does not legally qualify as personal data.

Use cases and benefits for AI-generated synthetic data platforms in healthcare

Let's summarize the most advanced healthcare data platform use cases and their benefits for AI-generated synthetic data. These are based on our experience and direct observations of the healthcare industry from up-close.

Machine learning model development with synthetic data

The reason why most machine learning projects fail is a lack of high-quality, large-quantity, realistic data. Synthetic data can be safely used in place of real patient data to train and validate machine learning models. Synthetic data generators can shorten time-to-market by unlocking valuable data assets by taking care of data prep steps such as data exploration at a granular level, data augmentation, and data imputation. Since data replaced code, healthcare data platforms became the most important part of MLOps infrastructures and machine learning product development.

Data synthesis for data privacy and compliance

Protected health information, or PHI, is heavily regulated both by HIPAA and GDPR. PHI can only be accessed by authorized individuals for specific purposes, which makes all secondary use cases and further research practically impossible. Organizations disregarding these rules face heavy fines and eroding patient trust. Synthetic data can be used to protect patient privacy by preserving sensitive information while still allowing for data analysis.

Testing and simulation with synthetic data generators

Synthetic data helps researchers to forecast the effects of greater sample size and longer follow-up duration on already existing data, thus informing the design of the research methodology. MOSTLY AI's synthetic data generator, in particular, is well suited to carry out on-the-fly explorations, allowing researchers to query what-if scenarios and reason around data effectively.

Data repurposing for research and development 

Often, using data for secondary purposes is challenging or downright prohibited by regulators. Synthetic data can overcome these limitations and be used as a drop-in placement to support research and development efforts. Researchers can also use synthetic data to create a so-called 'synthetic control arm.' According to Meshari F. Alwashmi, a digital health scientist:

"Instead of recruiting patients to sign up for trials to not receive the treatment (being the control group), they can turn to an existing database of patient data. This approach has been effective for interpreting the treatment effects of an investigational product in trials lacking a control group. This approach is particularly relevant to digital health because technology evolves quickly and requires rapid iterative testing and development. Furthermore, data collected from pilot studies can be utilized to create synthetic datasets to forecast the impact of the intervention over time or with a larger sample of patients."

Population health analysis for policy-making 

AI-generated synthetic data can be used to model and analyze population health, including disease outbreaks, demographics, and risk factors. According to the European Commission's Joint Research Center, synthetic data is "becoming the key enabler of AI in business and policy applications in Europe." Especially since the pandemic, the urgency to provide safe and accessible healthcare data platforms on the population level is on the increase and the idea of data democratization is no longer a far-away utopia, but a strong driver in policy-making.   

Data collaborations for innovation

Creating healthcare data platforms by proactively synthesizing and publishing health data is a winning strategy we see at large organizations. Humana, one of the largest health insurance providers in North America, launched a synthetic data exchange platform to facilitate third-party product development. By joining the sandbox, developers can access synthetic, highly granular datasets and create products with laser-sharp personalizations, solving real-life problems. In a similar project, Merkur Insurance in Austria uses MOSTLY AI’s synthetic data platform to develop new services and personalized customer experiences. For example, to develop machine learning models using privacy-safe and GDPR-compliant training data. 

Ethical and explainable AI

Synthetic data generation provides additional benefits to AI and machine learning development. Ethical AI is one area where synthetic data research is advancing rapidly. Fair models and predictions need fair data inputs and the process of data synthesis allows research teams to explore definitions of fairness and their effects on predictions with fast iterations. Furthermore, the introduction of the first AI regulations is only a matter of time. With regulations comes the need for explainability. Explainable AI needs synthetic data - a window into the souls and inner workings of algorithms, without trust can never be truly established. 

If you would like to explore what a synthetic healthcare data platform can do for your company or research, contact us and we'll be happy to share our experience and know-how.

According to Gartner, "data and analytics leaders who share data externally generate three times more measurable economic benefit than those who do not." Yet, organizations even struggle to collaborate on data within their own walls. No matter the architecture, somehow, everyone ends up with rigid silos and uncooperative departments. Why? Because data collaboration is a lot of work.

The data mesh approach to collaboration

Treating data as a product and assigning ownership to people closest to the origins of the particular data stream makes perfect sense. The data mesh architecture attempts to reassign data ownership from a central focal point to decentralized data owners with domain knowledge embedded into teams across the entire organization. But the data mesh is yet to solve the cultural issues. What we see time and time again at large organizations is people becoming overly protective of the data they were entrusted to govern. Understandably so. The zero trust approach is easy to adopt in the world of data, where erring on the side of caution is justified. Data breaches are multimillion-dollar events, damaging reputations on all levels, from organizational to personal. Without trusted tools to automatically embed governance policies into data product development, data owners will always remain reluctant to share and collaborate, no matter the gains interconnecting data products offer.  

The synthetic data mesh for data collaboration

Data ecosystems are already built with synthetic data, accelerating AI adoption in the most data-critical industries, such as finance. When talking about accelerating data science in finance, Jochen Papenbrock, Head of Financial Technology at NVIDIA said: 

"Synthetic data is a key component for evaluating AI models, and it's also a key component of collaboration in the ecosystem. My personal belief is that as we see a strong growth of AI adoption, and we'll see a strong growth in the adoption of synthetic data at the same speed."

So making synthetic data generation tools readily available for data owners should be considered a critical component of the data mesh. Proactively synthesizing and serving data products across domains is the next step on your journey of weaving the data mesh and scaling data collaborations. Readily available synthetic data repositories create new, unexpected value for data consumers and the business.

Synthetic data architecture

Examples of synthetic data products

Accelerating AI innovation is already happening at companies setting the standards for data collaborations. Humana, one of the largest North American health insurance providers, launched a synthetic data exchange to accelerate data-driven collaborations with third-party vendors and developers. Healthcare data platforms populated with realistic, granular and privacy safe synthetic patient data are mission-critical for accelerating research and product development.

Sometimes data silos are legal requirements, and certain data assets cannot be joined for compliance reasons. Synthetic data versions of these datasets serve as drop-in placements and can interconnect the domain knowledge contained in otherwise separated data silos. In these cases, synthetic data products are the only option for data collaboration.

In other cases, we've seen organizations with a global presence use synthetic data generation for massive HR analytics projects, connecting employee datasets from all over the world in a way that is compliant with the strictest regulations, including GDPR.

The wide adoption of AI-enabled data democratization represents breakthrough moments in how data consumers access data and create value. The intelligence data contains should no longer be locked away in carefully guarded vaults but flowing freely between teams and organizations. 

The benefits of data collaborations powered by synthetic data

Shareable synthetic data helps data owners who want to collaborate and share data in and out of organizations by reducing time-to-data and governance costs, enabling innovation, democratizing data, and increasing data literacy. Unlike legacy data anonymization, which reduces data utility. The reduction in time-to-data in itself is significant.

"According to our estimates, creating synthetic data products results in a 90%+ reduction in time-to-consumption in downstream use cases. Less new ideas are left on the cutting room floor, and more data is making an impact in the business.” says John Sullivan, Head of Customer Experience at MOSTLY AI.

MOSTLY AI's synthetic data platform was created with synthetic data products in mind - synthetic data can be shared directly from the platform together with the automatically generated quality assurance report. 

Synthetic data sharing in MOSTLY AI's synthetic data platform
Sharing synthetic data directly from MOSTLY AI's synthetic data platform is possible

Data mesh vs. data fabric with synthetic data in mind

Mario Scriminaci, MOSTLY AI’s Chief Product Officer thinks, that the concept of the data mesh and data fabric is often perceived as antithetical.

“The difference between the two architectures is that the data mesh pushes for de-centralization, while the data fabric tries to aggregate all of the knowledge about metadata. In reality, they are not mutually exclusive. The concepts of the data mesh and the data fabric can be applied simultaneously in big organizations, where the complexity of data architecture calls for a harmonized view of data products. With the data consumption and data discovery initiatives, synthetic data generation will help centralize the knowledge of data and datasets (aka. the data fabric) and, at the same time, will also help customize datasets to domain-specific needs (aka. data mesh).”

In a data mesh architecture, data ownership and privacy are crucial considerations. Synthetic data generation techniques allow organizations to create realistic data that maintains privacy. It enables data collaboration between teams across organizations to produce and share synthetic data products with high utility.

Data mesh architectures promote the idea of domain-oriented, self-serve data teams. Synthetic data allows teams to experiment, develop, and test data pipelines and applications independently, fostering agility and making data democratization an everyday reality.

Synthetic data products also eliminate the need to replicate or move vast volumes of real data across systems, making it easier to scale and optimize data processing pipelines and enabling data collaboration at scale.

Smart data imputation with AI-generated synthetic data is superior to all other methods out there. Synthetic data generated with MOSTLY AI is highly representative, highly realistic, granular-level data that can be considered ‘as good as real’. While maintaining complete protection of each data subject's privacy, it can be openly processed, used, and shared among your peers. MOSTLY AI’s synthetic data serves various use cases and the initiatives that our customers have so far achieved, prove and demonstrate the value that our synthetic data platform has to offer.

To date, most of our customers' use cases centered around generating synthetic data for data sharing while maintaining privacy standards. However, we believe that there is more on offer for data consumers around the world. Poor-quality data is a problem for data scientists across all industries.

Real-world datasets have missing information for various reasons. This is one of the most common issues data professionals have to deal with. The latest version of MOSTLY AI's synthetic data generator introduces features that can be utilized by users to interact with their original dataset - the so-called 'data augmentation' features. Among those is our ‘Smart Imputation’ technique which can accurately recreate the original distribution while filling the gaps for the missing values. This is gold for analytical purposes and data exploration!

What is smart data imputation?

Data imputation is the process of replacing missing values in a dataset with non-missing values. This is of particular interest, if the analysis or the machine learning algorithm cannot handle missing values on its own, and would otherwise need to discard partially incomplete records.

Many real-world datasets have missing values. On the one hand, some of the missing values may exist as they may hold important information depending on the business. For instance, a missing value in the ‘Death Date’ column means that the customer is still alive or a missing value in the ‘Income’ column means that the customer is unemployed or under-aged. On the other hand, oftentimes the missing values are caused by an organization's inability to capture this information. 

Thus, organizations look for methods to impute missing values for the latter case because these gaps in the data can result in several problems:

As a result, data scientists may employ rather simplistic methods to impute missing values which are likely to distort the overall value distribution of their dataset. These strategies include frequent category imputation, mean/median imputation, and arbitrary value imputation. Thoughtful consideration should be given to the fact that well-known machine learning libraries like scikit-learn have introduced data scientists to several univariate and multivariate imputation algorithms, including, respectively, "SimpleImputer" and "IterativeImputer."

Finally, the scikit-learn 'KNNImputer' class, which offers imputation for filling in missing data using the k-Nearest Neighbors approach, is a popular technique that has gained considerable attention recently.

MOSTLY AI’s smart imputation technique seeks to produce precise and accurate synthetic data so that it is obvious right away that the final product is of "better" quality than the original dataset. At this point, it is important to note that MOSTLY AI is not yet another tool that merely replaces a dataset's missing values. Instead, we give our customers the ability to create entirely new datasets free of any missing values. Go ahead and continue reading if this piques your curiosity so you may verify the findings for yourself.

Data imputation with synthetic data values
Data imputation with synthetic values

Evaluating smart imputation

We devise the below technique to compare the originally targeted distribution with the synthetic one to evaluate MOSTLY AI's smart imputation feature.

Starting with the well-known US-Census dataset, we use column ‘age’ as our targeted column. The dataset has approximately 50k records and includes 2 numerical variables and 9 categorical variables. The average age of the targeted column is 38.6 years, with a range of 17 to 90 years and a standard deviation of 13.7 years.

US Census dataset age distribution
US Census dataset - age column distribution

Our research begins by introducing semi-randomly some missing values into the US-Census dataset's "age" column. The goal is to compare the original distribution with the smartly imputed distribution and see whether we can correctly recover the original one.

We applied the following logic to introduce missing values in the original dataset, to artificially bias the non-missing values towards younger age segments. The age attribute was randomly set to missing for:

It's important to note that by doing this, the algorithm won't be able to find any patterns or rules on where the missing values are located.

As a result, the 'age' column now has missing numbers that appear to be missing semi-randomly. The column's remaining non-missing values are therefore skewed in favor of younger people:

Missing data values in the US Census dataset
Missing values in the age column

As a next step, we synthesized and smartly imputed the US-Census dataset with the semi-random missing values on the "age" column using the MOSTLY AI synthetic data platform.

We carried out two generations of synthetic data. The first one is for generating synthetic data without enabling imputation and as expected the synthetic dataset matches the distribution of the data used to train the model (Synthetic with NAs - light green). The second one is for generating synthetic data enabling MOSTLY AI’s Smart imputation feature for the ‘age’ column. As we can see, the smartly imputed synthetic data perfectly recovers the original distribution! 

Smart synthetic data imputation evaluation
Evaluation of synthetic data imputation data

After adding the missing values to the original dataset, we started with an average age of 37.2 and used the "Smart imputation" technique to reconstruct the "age" column. The initial distribution of the US-Census data, which had an average age of 38.6, is accurately recovered in the reconstructed column, which now has an average age of 39.

These results are great for analytical purposes. Data scientists now have access to a dataset that allows them to operate without being hindered by missing values. Now let's see how the synthetic data generation method compares to other data imputation methods.

Data imputation methods: a comparison

Below we are describing 6 of the main imputation techniques for numerical variables and we are going to compare the results with our Smart Imputation algorithm. For each technique below we are presenting a general summary statistics of the ‘age’ distribution as well as a visual representation of the results against MOSTLY AI’s results.

Arbitrary value data imputation

Arbitrary value imputation is a type of data imputation technique used in machine learning to fill in missing values in datasets. It involves replacing missing values with a specified arbitrary value, such as 0, 99, 999, or negative values. Instead of imputing the numbers using statistical averages or other methods, the goal is to flag the values.

This strategy is quite simple to execute, but it has a number of disadvantages. For starters, if the arbitrary number utilized is not indicative of the underlying data, it can inject bias into the dataset. For example, if the mean is used to fill in missing values in a dataset with outliers or extreme values, the imputed values may not accurately reflect the underlying distribution of the data.

Using an arbitrary value can limit dataset variability, making it more difficult for machine learning algorithms to find meaningful patterns in the data. As a result, forecast accuracy and model performance may suffer.

As you can see, the variable was given new peaks, drastically altering the initial distribution.

Arbitrary Value Imputation vy Synthetic Data Imputation
Arbitrary Value Data Imputation Results

Start/End of Distribution data imputation

Start/End of Distribution data imputation is a form of data imputation technique used to fill in missing values in datasets. It involves replacing missing values with values at the beginning or end of the distribution of non-missing values in the dataset.

If the missing values are numeric, for example, the procedure involves replacing the missing values with the lowest or maximum value of the dataset's non-missing values. If the missing values are categorical, the procedure involves filling in the gaps with the most often occurring category (i.e., the mode).

Similar to the previous technique, as an advantage, is a simple technique to implement and our ML models could capture the significance of any missing values. The main drawback is that we might end up with a distorted dataset as the mean and variance of distribution might change significantly.

Similar to the previous technique, the variable was given new peaks, drastically altering the initial distribution.

Start/End Distribution data imputation vs synthetic data imputation
Start/End Distribution data imputation results

Mean/Median/Mode Imputation

Mean/Median/Mode imputation is probably the most popular data imputation method, at least among beginners. The Mean/Median/Mode data imputation method tries to impute missing numbers using statistical averages.

Mean data imputation involves filling the missing values with the mean of the non-missing values in the dataset. Median imputation involves filling the missing values with the median of the non-missing values in the dataset. Mode imputation involves filling the missing values with the mode (i.e., the most frequently occurring value) of the non-missing values in the dataset.

These techniques are straightforward to implement and beneficial when dealing with missing values in small datasets or datasets with a simple structure. However, if the mean, median, or mode is not indicative of the underlying data, they can add bias into the dataset.

The results start looking better than the previous techniques, however as can be seen the imputed distributions are still distorted.

Mean/Median/Mode data imputation vs synthetic data imputation
Mean/Median/Mode data imputation results

Scikitlearn - SimpleImputer data imputation

Scikit-learn is a well-known Python machine learning library. The SimpleImputer class in one of its modules, sklearn.impute, provides a simple and efficient technique to impute missing values in datasets.

The SimpleImputer class can be used to fill in missing data using several methodologies such as mean, median, mode, or a constant value. It can also be used to fill in missing values by selecting the most common value along each column or row, depending on the axis. SimpleImputer is a univariate imputation algorithm that comes out of the box with the sci-kit learn library.

The results below are similar to the results of the previous technique:

Scikit-learn - SimpleImpute data imputation vs synthetic data imputation
Scikit-learn - SimpleImpute data imputation results

Scikitlearn - IterativeImputer data imputation

Another class in the Scikit-learn's sklearn.impute module that may be used to impute missing values in datasets is IterativeImputer. IterativeImputer, as opposed to SimpleImputer, uses a model-based imputation strategy to impute missing values by modelling the link between variables.

IterativeImputer estimates missing values using a machine learning model. A variety of models are supported by the class, including linear regression, Bayesian Ridge regression, k-nearest neighbours regression, decision trees, and random forests.

Entering the more sophisticated techniques you can see that the imputed distribution is getting closer to the original ‘age’ distribution.

Scikitlearn - IterativeImputer data imputation vs synthetic data imputation
data imputation with scikitlearn iterativeimputer

Scikitlearn - KNNImputer data imputation

Let’s look at something a little more complex. K-Nearest Neighbors, or KNN, is a straightforward method that bases predictions on a specified number of nearest neighbours. It determines the separations between each instance in the dataset and the instance you want to classify. Here, classification refers to imputation.

It is simple to implement and optimize. In comparison to the other methods employed so far, it is also a little bit ‘smarter’. Unfortunately, it is prone to outliers. It can be used only on numerical variables hence only those used from the US-census dataset to produce the results below.

The summary statistics look very close to the original distribution. However, the visual representation is not that good. The imputed distribution is getting closer to the original one and visually it seems that KNNImputer produces the best results so far.

Scikitlearn - KNNImputer data imputation vs synthetic data imputation
data imputed with scikitlearn KNNImputer data imputation

Comparison of data imputation methods: conclusion

Six different techniques were used to impute the US-census ‘age’ column. Starting with the simplest ones we have seen that the distributions are distorted and the utility of the new dataset drops significantly. Moving to the more advanced methodologies, the imputed distributions are starting to look like the original ones but are still not perfect.

We have plotted all the imputed distributions against the original as well as the distribution generated by MOSTLY AI’s Smart Imputation feature. We can clearly conclude that AI-powered synthetic data imputation captures the original distribution better.

We at MOSTLY AI are excited about the potential that ‘Smart Imputation’ and the rest of our 'Data Augmentation' and 'Data Diversity' features have to offer to our customers. More specifically, we would like to see more organizations using synthetic data across industries and to reduce the time-consuming task of dealing with missing data - time that data professionals can use to produce valuable insights for their organizations.

We are eager to explore these paths further with our customers to assist their ML/AI endeavours, at a fraction of the time and expense, since the explorations in this blog post have shown the potential to support such a scenario. If you are currently facing the same struggle of dealing with missing values in your data, check out MOSTLY AI's synthetic data generator to try Smart imputation on your own.

Learn how to generate high-quality synthetic datasets from real data samples without coding or credit cards in your pocket. Here is everything you need to know to get started. 

In this blogpost you will learn:

What do you need to generate synthetic data?

If you want to know how to generate synthetic data, the good news is, that you need absolutely no coding knowledge to be able to synthesize datasets on MOSTLY AI’s synthetic data platform. What is even better news is that you can have access to the world’s best quality synthetic data generation for free, generating up to 100K rows daily. Not even a credit card is required, only a suitable sample dataset.

First, you need to register a free synthetic data generation account using your email address. Second, you need a suitable data sample. If you want to know how to generate synthetic data using an AI-powered tool, like MOSTLY AI, you need to know how to prepare your sample dataset, that the AI algorithm will learn from. We'll tell you all about what makes a dataset ready for synthesization in this blogpost.

When do you need a synthetic data generation tool?

Generating synthetic data based on real data makes sense in a number of different use cases and data protection is only one of them. Thanks to how the process of synthetic data generation works, you can use an AI-powered synthetic data generator to create bigger, smaller or more balanced, yet realistic versions of your original data. It’s not rocket science. It’s better than that - it’s data science automated.

When choosing a synthetic data generation tool, you should take two very important benchmarks into consideration: accuracy and privacy. Some synthetic data generators are better than others, but all synthetic data should be quality assured and MOSTLY AI’s platform generates an automatic privacy and accuracy report for each synthetic data set. What’s more, MOSTLY AI’s synthetic data is better quality than open source synthetic data.

If you know what you are doing, it's really easy to generate realistic and privacy-safe synthetic data alternatives for your structured datasets. MOSTLY AI's synthetic data platform offers a user interface that is easy to navigate and requires absolutely no coding. All you need is sample data and a good understanding of the building blocks of synthetic data generation. Here is what you need to understand how to generate synthetic data.

Generate synthetic data straight from your browser

MOSTLY AI's synthetic data platform allows you to get hands-on and experiment with synthetic data generation quickly and easily. Register your free forever account and synthesize up to 100K rows of production-like data daily!
Generate synthetic data

What is a data sample?

Generative tabular data is based on real data samples. In order to create AI-generated synthetic data, you need to provide a data sample of your original data to the synthetic data generator to learn its statistical properties, like correlations, distributions and hidden patterns.

Ideally your sample data set should contain at least 5,000 data subjects (= rows of data). If you don't have that much data that doesn't mean you shouldn’t try - go ahead and see what happens. But don't be too disappointed if the achieved data quality is not satisfactory. Automatic privacy protection mechanisms are in place to protect your data subjects, so you won’t end up with something potentially dangerous in any case.

What are data subjects?

The data subject is the person or entity whose identity you want to protect. Before considering synthetic data generation, always ask yourself whose privacy you want to protect. Do you want to protect the anonymity of the customers in your webshop? Or the employees in your company? Think about whose data is included in the data samples you will be using for synthetic data generation. They or those will be your data subjects.

The first step of privacy protection is to clearly define the protected entity. Before starting the synthetic data generation process, make sure you know who or what the protected entities of a given synthesization are.

What is a subject table?

The data subjects are defined in the subject table. The subject table has one very crucial requirement: one row is one data subject. All the information which belongs to a single subject - e.g. a customer, or an employee - needs to be contained in the row that belongs to the specific data subject. In the most basic data synthesization process, there is only one table, the subject table.

This is called a single-table synthesization and is commonly used to quickly and efficiently anonymize datasets describing certain populations or entities. In contrast with old data anonymization techniques like data masking, aggregation or randomization, the utility of your data will not be affected by the synthesization.

How and when to synthesize a single subject table?

Synthesizing a single subject table is the easiest and fastest way to generate highly realistic and privacy safe synthetic datasets. If you are new to synthetic data generation, a single table should be the first thing you try.

If you want to synthesize a single subject table, your original or sample table needs to meet the following criteria:

Information entered into the subject table should not be time-dependent. Time-series data should be handled in two or more tables, called linked tables, which we will talk about later.

A single table for synthetic data generation
A single table for synthetic data generation

 What is not a subject table? 

So, how do you spot it if your table that you would like to synthesize is not a subject table? If you see the same data subjects in the table twice, in different rows, it’s fair to say that the table you have cannot be a subject table as it is. In the example below, you can see a company’s records of sick leaves. Since more than one row belongs to the same person, this table would not work as a subject table.

An example of a table that is not a subject table
Example of a table, that contains personal information, but is not a subject table

There are some other examples when a table cannot be considered a subject table. For example, when the rows contain overlapping groups, the table cannot be used as a subject table, because the requirement of independent rows is not met.

Example of a badly structured subject table
Rows of a subject table cannot contain overlapping groups 

Another example of datasets not suitable for single-table synthesization are datasets that contain behavioral or time-series data. Here the different rows come with time dependencies. Tables containing data about events need to be synthesized in a two-table set up.

If your dataset is not suitable as a subject table "out of the box" you will need to perform some pre-processing of the data to make it suitable for data synthesization.

It’s time to launch your first synthetic data generation job!

How to generate synthetic data step by step

The good news is, that you need absolutely no coding knowledge to be able to synthesize datasets on MOSTLY AI’s synthetic data platform. What is even better news is that you can have access to the world’s best quality synthetic data generation for free, generating up to 100K rows daily. Not even a credit card is required, only a suitable subject table. First, you need to register a free synthetic data generation account using your email address.

Step 1 - Upload your subject table

Once you are inside MOSTLY AI’s synthetic data platform, you can upload your subject table. Click on Jobs, then on Launch a new job. Your subject table needs to be in CSV or Parquet format. We recommend using Parquet files. 

Feel free to upload your own dataset - it will be automatically deleted once the synthetic data generation has taken place. MOSTLY AI’s synthetic data platform runs in a secure cloud environment and your data is kept safe by the strictest data protection laws and policies on the globe, since we are a company with roots in the EU.

Synthetic data generation - first step - upload data
Upload CSV or Parquet files or use data connectors

Step 2 - Check column types

Once you upload your subject table, it’s time to check your table’s columns under the Table details tab. MOSTLY AI’s synthetic data platform automatically identifies supported column types. However, you might want to change these from what was automatically detected. Your columns can be:

There are other column types you can use too, for example, text or location coordinates, but these are the main column types automatically detected by the synthetic data generator.

Synthetic data generation - detecting data types for columns
Column types are automatically detected by MOSTLY AI, but you can override column types manually if needed

Step 3 - Train and generate

Under the Settings tabs you have the option to change how the synthesization is done. You can specify how many data subjects you want the synthetic data generator to learn from and how many you want to generate. Changing these would make sense for different use cases.

For example, if you want to generate synthetic data for software testing, you might choose to downsample your original dataset into smaller, more manageable chunks. You can do this by entering a smaller number of generated subjects under Output settings than what is in your original subject tables. 

Synthetic data generation downsampling output
Create smaller, yet statistically representative versions of your datasets using MOSTLY AI's synthetic data generation platform

Pro tip from our data scientists: enter a smaller number of training subjects than what your original dataset has to launch a trial synthesization. Let’s say you have 1M rows of data. Use only 10K of the entire data set for a test run. This way you can check for any issues quickly. Once you complete a successful test run, you can use all of your data for training. If you leave the Number of training subjects field empty, the synthetic data generator will use all of the subjects of your original dataset for the synthesization. 

Generating more data samples than what was in the original dataset can be useful too. Using synthetic data for machine learning model training can significantly improve model performance. You can simply boost your model with more synthetic samples than what you have in production or upsample minority records with synthetic examples.

You can also optimize for a quicker synthesization by changing the Training goal to Speed from Accuracy.

Optimize synthetic data generation in accordance with your use case

Once the process of synthesization is complete, you can download your very own synthetic data! Your trained model is saved for future synthetic data generation jobs, so you can always go back and generate more synthetic records based on the same original data. You can also choose to generate more data or to download the Quality Assurance report. 

Download your synthetic data as csv or parquet

Step 4 - Check the quality of your synthetic data

Each synthetic data set generated on MOSTLY AI’s synthetic data platform comes with an interactive quality assurance report. If you are new to synthetic data generation or less interested in the data science behind generative AI, simply check if the synthetic data set passed your accuracy expectations. If you would like to dive deeper into the specific distributions and correlations, take a closer look at the interactive dashboards of the QA report.

Synthetic data quality assurance report with detailed, interactive charts on privacy, distributions, correlations and accuracy

How and when to synthesize data in two tables?

Synthesizing data in two separate tables is necessary when your data set contains temporal information. In more simple terms, to synthesize events, you need to separate your data subjects - the people or entities to whom the events or behavior belong to - and the events themselves. For example, credit card transaction data or patient journeys are events that need to be handled differently than descriptive subject tables. This so-called time-series or behavioral data needs to be included in linked tables.

What are linked tables? 

Now we are getting to the exciting part of synthetic data generation, that is able to unlock the most valuable behavioral datasets, like transaction data, CRM data or patient journeys. Linked tables containing rich behavioral data is where AI-powered synthetic data generators really shine. This is due to their ability to pick up on patterns in massive data sets that would otherwise be invisible to the naked eyes of data scientists and BI experts.

These are also among the most sensitive data types, full of extremely valuable (think customer behavior), yet off-limits, personally identifiable, juicy details. Behavioral data is hard to anonymize without destroying the utility of the data. Synthetic behavioral data generation is a great tool for overcoming this so-called privacy-utility trade off.

How to create linked tables?

The structure of your sample needs to follow the subject table - linked table framework. We already discussed subject tables - here the trick is to make sure that information about one data subject must be contained in one row only. You should move columns that are static to the subject table and model the rest as a linked table.

MOSTLY AI’s algorithm learns statistical patterns distributed in rows, so if you have information that belongs to a single individual across multiple rows, you’ll be creating phantom data subjects. The resulting synthetic data might include phantom behavioral patterns not present in the original data.

The perfect set up for synthetic data generation

Your ideal synthetic data generation set up is where the subject table’s IDs refer to the events contained in the linked table. The linked table contains several rows that refer to the same subject ID - these are events that belong to the same individual.

Ideal two-table setup for synthesization
An ideal set up for a two-table synthesization of time-series data

Keeping your subject table and linked table aligned is the most important part of a successful synthetic data generation. Include the ID columns in both tables as primary and foreign keys for establishing the referential relationship.

How to connect subject and linked tables for synthesization?

MOSTLY AI’s synthetic data generation platform offers an easy-to-use, no-code interface where tables can be linked and synthesized. Simply upload your subject table and linked tables.

Uploading subject and linked tables for synthetic data generation

The platform automatically detects primary keys (the id column) and foreign keys (the  <subject_table_name> _id column) once the subject table and the linked tables are specified. You can also select these manually. Once you defined the relationship between your tables, you are ready to launch your synthesization for two tables. 

Relationship between tables are automatically detected

Synthetic data types and what you should know about them

The most common data types - numerical, categorical and datetime - are recognized by MOSTLY AI and handled accordingly. Here is what you should know when generating synthetic data from different types of input data.

Synthetic numeric data

Numeric data contains only numbers and are automatically treated as numeric columns. Synthetic numeric data keeps all the variable statistics such as mean, variance and quantiles. N/A values are handled separately and the proportion of that is retained in the synthetic data. MOSTLY AI automatically detects missing values and reproduces it in the synthetic data, for example, if the likelihood of N/A changes depending on other variables. N/A needs to be encoded as empty strings.

Extreme values in numeric data have a high risk of disclosing sensitive information, for example, by exposing the CEO in a payroll database as the highest earner. MOSTLY AI’s built-in privacy mechanism replaces the smallest and largest outliers with the smallest and largest non-outliers to protect the subjects’ privacy.

If the synthetic data generation relies on only a few individuals for minimum and maximum values, the synthetic data can differ in these. One more reason to give the CEO’s salary to as many people in your company as possible is to protect his or her privacy - remember this argument next time equal payment comes up. 🙂 Kidding aside, removing these extreme outliers is a necessary step to protect from membership inference attacks. MOSTLY AI’s synthetic data platform does this automatically, so you don’t have to worry about outliers.

Synthetic datetime data type

Columns in datetime format are treated automatically as datetime columns. Just like in the case of synthetic numeric data, extreme datetime values are also protected and the distribution of N/As is preserved. In linked tables, using the ITT encoding for inter-transaction time improves the accuracy of your synthetic data on time between events, for example when synthesizing ecommerce data with order statuses of order time, dispatch time, arrival time.

Synthetic categorical data type

Categorical data comes with a fixed number of possible values. For example, marital status, qualifications or gender in a database describing a population of people. Synthetic data retains the probability distribution of the categories, containing only those categories present in the original data. Rare categories are protected independently for each categorical column.

Synthetic location data type

MOSTLY AI’s synthetic data generator can synthesize geolocation coordinates with high accuracy. You need to make sure that latitude and longitude coordinates are in a single field, separated by a comma, like this: 37.311, 173.8998

Synthetic geolocation data vs real data
Comparison of original and synthetic location data

Synthetic text data type

MOSTLY AI’s synthetic data generator can synthesize up to 1000 character long unstructured texts. The resulting synthetic text is representative of the terms, tokens, their co-occurrence and sentiment of the original. Synthetic text is especially useful for machine learning use cases, such as sentiment analysis and named-entity recognition. You can use it to generate synthetic financial transactions, user feedback or even medical assessments. MOSTLY AI is language agnostic, so you won’t experience biases in synthetic text. 

Synthetic text data vs original text
Original vs synthetic text

You can improve the quality of your text columns that include specific patterns, like email addresses, phone numbers, transaction IDs, social security numbers, if you change these to character sequence data type.

Configure synthetic data generation model training

You can optimize the synthetic data generation process for accuracy or speed, depending on what you need the most. Since the main statistical patterns are learned by the synthetic data generator in the first epochs, you can choose to stop the training by selecting the Speed option if you don’t need to include minority classes and detailed relationships in your synthetic data.

When you optimize for accuracy, the training continues until no further model improvement can be achieved. Optimizing for accuracy is a good idea when you are generating synthetic data for data analytics use cases or outlier detection. If you want to generate synthetic data for software testing, you can optimize for speed, since high statistical accuracy is not an important feature of synthetic test data.

synthetic data generation optimized for speed
When you optimize synthetic data generation for speed, training is stopped once improvements decrease

Synthetic data use cases from no-brainer to genius

The most common synthetic data use cases range from simple, like data sharing to complex, like explainable AI, which is part data sharing and part data simulation.

Another entry level synthetic data generation project can be generating realistic synthetic test data, for example for stress testing and for the delight of your off-shore QA teams. As we all know, production data should never, ever see the inside of test environments (right?). However, mock data generators cannot mimic the complexity of production data. Synthetic test data is the perfect solution combining realism and safety in one.

Synthetic data is also one of the most mature privacy-enhancing technologies. If you want to share your data with third parties safely, it’s a good idea to run it through a synthetic data generator first. And the genius part? Synthetic data is set to become a fundamental part of explainable and fair AI, ready to fix human biases embedded in datasets and to provide a data window into the souls of algorithms. 

Expert synthetic data help every step of the way

No matter which use case you decide to tackle first, we are here for you from the first steps to the last and beyond! If you would like to dive deeper into synthetic data generation, please feel free to browse and search through MOSTLY AI’s Documentation. But most importantly, practice makes best, so register your free forever account and launch your first synthetic data generation job now.

 ✅ Data prep checklist for synthetic data generation

1. SPLIT SINGLE SEQUENTIAL DATASETS INTO SUBJECT AND LINKED TABLES

If your raw data includes events and is contained in a single table, you need to split it into a subject table and a linked table. If your single table contains event data, move these sequential data points into another table. Make sure that the new table is linked by the foreign key to the primary key in the subject table. That is, each individual or entity in the subject table is referred to by the linked table with the relevant ID. 

How sequential data is structured also matters. If your events are contained in columns, make sure you model them into rows. Each row should describe a separate event. 

How to split data into subjects and events
Split data into separate tables - subject and events

Some examples of typical dataset synthesized for a wide variety of use cases include different types of time series data, like patient journeys where a list of medical events are linked to individual patients. Synthetic data in banking is often created from transaction datasets where subject tables contain accounts and the linked table contains the transactions that belong to certain accounts, referenced in the subject table. These are all time-dependent, sequential datasets where chronological order is an important part of the data’s intelligence. 

When preparing your datasets for synthesization, always consider the following list of requirements:

Subject TableLinked table
Each row belongs to a different individualSeveral rows belong to the same individual
The subject ID (primary key in SQL) must be uniqueEach row needs to be linked to one of the unique IDs in the subject table (foreign key in SQL)
Rows should be treated independentlySeveral rows can be interrelated
Includes only static informationIncludes only dynamic information where sequences must be time-ordered if available
The focus is on columnsThe focus is on rows and columns

2. MOVE ALL STATIC DATA TO THE SUBJECT TABLE

Check your linked table containing events. If you have static information in the linked table, that is describing the subject, you should move that column to the subject table. A good example would be a list of page visits, where each page visit is an event that belongs to certain users. The IP address of users is the same across different events. It’s static and describes the user, not the event. In this case, the IP_address column needs to be moved over to the subject table. 

Modelling subject tables for synthetic data generation
The IP address column should be moved to the subject table

3. CHECK AND CONFIGURE DATA TYPES

The most common data types, numerical, categorical and datetime are automatically detected by MOSTLY AI’s synthetic data platform. Check if the data types were detected correctly and change the encoding where you have to. If the automatically detected data types don’t match your expectations, double check the input data. Chances are, that a formatting error is leading detection astray and you might want to fix that before synthesization. Here are the cases when you should check and manually adjust data types:

Data intelligence is locked up. Machine learning and AI in insurance is hard to scale and legal obligations make the job of data scientists and actuaries extremely difficult. How can you still innovate under these circumstances? The insurance industry is being disrupted as we speak by agile, fast moving insurtech start-ups and new, AI-enabled services by forward-thinking insurance companies.

According to McKinsey, AI will “increase productivity in insurance processes and reduce operational expenses by up to 40% by 2030”. At the same time, old-school insurance companies struggle to make use of the vast troves of data they are sitting on and sooner or later, ambitious insurtechs will turn their B2B products into business to consumer offerings, ready to take sizeable market shares. Although traditional insurance companies’ drive to become data-driven and AI-enabled is strong, organizational challenges are hindering efforts.

Laggers have a lot to lose. If they are to stay competitive, insurance companies need to redesign their data and analytics processes and treat the development of data-centric AI in insurance with urgency.

The bird's-eye view on data in the insurance industry

Data has always been the bread and butter of insurers and data-driven decisions in the industry predate even the appearance of computers. Business-critical metrics have been the guiding force in everything insurance companies do from pricing to risk assessment.

But even today, most of these metrics are hand-crafted with traditional, rule-based tools that lack dynamism, speed and contextual intelligence. The scene is ripe for an AI revolution. 

The insurance industry relies heavily on push sales techniques. Next best products and personalized offers need to be data-driven. For cross selling and upselling activities to succeed, an uninterrupted data flow across the organization is paramount, even though often Life and Non-Life business lines are completely separated. Missed data opportunities are everywhere and require a commitment to change the status quo.

Interestingly, data sharing is both forbidden and required by law, depending on the line of business and properties of data assets in question. Regulations are plenty and vary across the globe, making it difficult to follow ambitious global strategies and turning compliance into a costly and tedious business. Privacy enhancing technologies or PETs for short can be of help and a modern data stack cannot do without them. Again, insurance companies should carefully consider how they build PETs into their data pipelines for maximum effect. 

The fragmented, siloed nature of insurance companies' data architecture can benefit hugely from using PETs, like synthetic data generators, enabling cloud adoption, data democratization and the creation of a consolidated data intelligence across the organization.

Structured data

Core System
CRM
General Ledger (Accounting)
PAS - Policy Admin System
Operational data
Regulatory software (e.g. IFRS)

Semi Structured data

Sensors (telematics, smart meters,
wearables, smart home IoT, etc.)

Unstructured data

Documents
Photos (cars, x-rays, etc.)
Satellite images

External Data Sources

Credit score data (Experian, Bloomberg etc.)
Third party data providers
Open source data (governmental statistic)
Open APIs
Synthetic data generator
Cloud/On-prem Data Storage 

Structure data

Automate Underwriting
Referral reduction
Distribution Quality Management
Pricing Optimization
Fraud Detection
Next Best Offers
Product Recommendation
Claims Triage
Personas Creation
Personalization

Intelligent Automation

RPA
Claims Automation
Explainability
Customer Service Support

Data Science/Analytics

Simple EDA
Process Mining
Business Process Reengineering
BI Dashboards

Data Sharing

ReInsurance
Other business lines
Sales network

Insurtech companies willing to change traditional ways and adopting AI in insurance with earnestness have been stealing the show left, right and center. As far back as 2017, a US insurtech company, Lemonade announced to have paid out the fastest claim in the history of the insurance industry - in three seconds.

If insurance companies are to stay competitive, they need to turn their thinking around and start redesigning their business with sophisticated algorithms and data in mind. Instead of programming the data, the data should program AI and machine learning algorithms. That is the only way for truly data-driven decisions, everything else is smoke and mirrors. Some are wary of artificial intelligence and would prefer to stick to the old ways. That is simply no longer possible.

The shift in how things get done in the insurance industry is well underway already. Knowing how AI in insurance systems works and what its potential is with best practices, standards and use cases is what will make this transition safe and painless, promising larger profits, better services and frictionless products throughout the insurance market.

What drives you? The three ways AI can have an impact

AI and machine learning typically can help achieving impact in three ways.

1. Increase profits 

Examples include campaign optimization, targeting, next best offer/action, and best rider based on contract and payment history.  According to the AI in insurance 2031 report by Allied Market Research, “the global AI in the insurance industry generated $2.74 billion in 2021, and is anticipated to generate $45.74 billion by 2031.”

2. Decrease costs

Examples of cost reduction includes reduced claims cost which is made up of two elements - claim amount and claim survey cost. According to KPMG, “investment in AI in insurance is expected to save auto, property, life and health insurers almost US$1.3 billion while also reducing the time to settle claims and improving customer loyalty.”

3. Increase customer satisfaction

AI-supported customer service and quality assurance AI systems can optimize claim processing and customer experience. Even small tweaks, optimized by AI algorithms can have massive effects. According to an IBM report, “claimants are more satisfied if they receive 80% of the requested compensation after 3 days, than receiving 100% after 3 weeks.”

What stops you? Five data challenges when building AI/ML models in traditional insurance companies

We spoke to insurance data practitioners about their day-to-day challenges regarding data consumption. There is plenty of room for improvement and for new tools making data more accessible, safer and compliant throughout the data product life-cycle. AI in insurance suffers from a host of problems, not entirely unrelated to each other. Here is a list of their five most pressing concerns.

1. Not enough data

Contrary to popular belief, insurance companies aren’t drowning in a sea of customer data. Since insurance companies have far fewer interactions with customers in comparison with banks or telcos, there is less data available for making important decisions, health insurance being the notable exception. Typically the only time customers interact with insurance providers is when the contract is signed and if everything goes well, the company never hears from them again.

External datasources, like credit scoring data won’t give you a competitive edge either - after all it’s what all your competitors are looking at too. Also, the less data there is, the more important architecture, data quality and the ability to augment existing data assets becomes. Hence, investment into data engineering tools and capabilities should be at the top of the agenda for insurance companies. AI in insurance cannot be made a reality without meaningful, high-touch data assets.

2. Data assets are fragmented

To make things even more complicated, data sits in different systems. In some countries, insurance companies are prevented by law from merging different datasets or even managing them within the same system. For example property-related and life-related insurances often have to be separated. Data integration is therefore extremely challenging.

Cloud solutions could solve some of these issues. However, due to the sensitive nature of customer data, moving to the cloud is often impossible, and a lot of traditional insurers still keep all their data assets on premise. As a result, a well-designed data architecture is mission-critical for ensuring data consumption. Today there often are no data integrations in place and a consolidated view across all the different systems is hard to create.

Access authorizations are also in place as well as curated datasets, but if you want to access any data, that is not part of usual business intelligence activities, the process is very slow and cumbersome, keeping data scientists away from what they do best: come up with new insights. 

3. Cybersecurity is a growing problem

Very often cybersecurity is thought of as the task of protecting perimeters from outside attacks. But did you know, that 59% of privacy incidents originate with an organization’s own employees? No amount of security training can guarantee that mistakes won’t be made and so, the next frontier for cybersecurity efforts should tackle the data itself.

How can data assets themselves be turned into less hazardous pieces of themselves? Can old school data masking techniques withstand a privacy attack? New types of attacks are popping up attacking data with AI-powered tools, trying to reidentify individuals based on their behavioral patterns.

These AI-based re-identification attacks are yet another reason to ditch data masking and opt for new-age privacy enhancing technologies instead. Minimizing the amount of production data in use should be another goal added to the already long list of cybersec professionals.

4. Insurance data is regulated to the extreme

Since insurance is a strategically important business for governments, they are subjected to extremely high levels of regulation. While in some instances insurance companies are required by law to share their data with competitors, in others, they are prevented from using certain parts of datasets, such as gender, by law.

It’s a complicated picture with tons of hidden challenges data scientists and actuaries need to be aware of if they want to be compliant. Data stewardship is becoming an increasingly important role in managing data assets and data consumption within and outside the walls of organizations.

5. Data is imbalanced

Class distribution of the data is often imbalanced: rare events such as fraud or churn are represented in only a fraction of the data. This makes it hard for AI and machine learning models to pick up on patterns and learn to detect them effectively. This issue can be solved on more ways than one and data rebalancing is something data scientists often do.

Data augmentation is especially important for fraud and churn prediction models, where only a limited number of examples are available. Upsampling minority classes with AI-generated synthetic data examples can be a great solution, because the synthesization process results in a more sophisticated, realistic data structure, than more rudimentary upsampling methods.

The low-hanging AI/ML fruit all insurance companies should implement and how synthetic data can make models perform better

Underwriting automation

Automating underwriting processes is one of the first low hanging fruits insurance companies reach for when looking for AI and machine learning use cases in insurance. Typically, AI and machine learning systems assist underwriters by providing actionable insights derived from risk predictions performed on various data assets from third party data to publicly available datasets. The goal is to increase Straight Through Process (STP) rates as much as possible.

Automated underwriting processes are replacing manual underwriting in ever increasing numbers across the insurance industry and whoever wins the race to maximum automation, takes the cake. There are plenty of out of the box underwriting solutions promising frictionless Robotic Process Automation, ready to be deployed.

However, no model is ever finished and historical data used to train these models can go out of date, leading underwriters astray in their decision making. The data is the soul of the algorithm and continuous monitoring, updates and validation are needed to prevent model drift and optimal functioning.

The value of 6 AI in insurance use cases
AI in insurance is creating different value through different use cases

Pricing predictions

Actuaries have long been the single and mysterious source of pricing decisions. Prices were the results of massively complicated calculations only a handful of people really understood. Almost like a black box with walls made out of the most extreme performances of human intellect.

The dusty and mysterious world of actuaries is not going away any time soon due to the legal obligations insurance companies need to satisfy. However, AI and machine learning are capable of significantly improving the process with their ability to process much more data than any single team of actuaries ever could. Machine learning models make especially potent pricing models. The more data they have, the better they do.

Fraud, anomaly and account take over prediction

Rule based fraud detection systems get very complicated very quickly. As a result, fraud detection is heavily expert-driven, costly to do and hard to maintain. Investigating a single potential fraud case can cost thousands of dollars. AI and machine learning systems can increase the accuracy of fraud detection and reduce the number of false positives, thereby reducing cost and allowing experts to investigate cases more closely.

With constantly evolving fraud methods, it’s especially important to be able to spot unusual patterns and anomalies quickly. Since AI needs plenty of examples to learn from, rare patterns, like fraud need to be upsampled in the training data.

This is how synthetic data generators can improve the performance of fraud and anomaly detection algorithms and increase their accuracy. Upsampled synthetic data records are better than real data for training fraud detection algorithms and they are able improve their performance by as much as 10%, depending on the type of machine learning algorithm used.

Next best offer prediction

Using CRM data, next best offer models support insurance agents in their upselling and cross selling activities. These product recommendations can make or break a customer journey. The accuracy and timing of these personalized recommendations can be significantly increased by predictive machine learning models. Again, data quality is a mission-critical ingredient - the more detailed the data, the richer the intelligence prediction models can derive from it.

To make next best action AI/ML models work, customer profiles and attributes are needed in abundance, with granularity and in compliance with data privacy laws. These recommendation systems are powerful profit generating tools, that are fairly easy to implement, given that the quality of the training data is sufficient.

Churn reduction

Even a few percentage points reduction in churn could be worth hundreds of thousands of dollars in revenue. It’s not surprising that prediction models, similar to those used for predicting next best offers or actions, are frequently thrown at the churn problem. However, identifying customers about to lapse on their bills or churn altogether is more challenging then making personalized product recommendations due to the lack of data.

Low churn rates lead to imbalanced data where the number of churning customers is too low for machine learning algorithms to effectively pick up on. Churn event censorship is another issue machine learning models find problematic, hence survival regression models are better suited to predict churn than binary classifiers. In this case, the algorithm predicts how long a customer is likely to remain a customer.

A series of churn prevention activities can be introduced for groups nearing their predicted time to event. Again, data is the bottleneck here. Loyalty programs can be great tools not only for improving customer retention in traditional ways, but also for mining useful data points for churn prediction models. 

Process mining

Insurance is riddled with massive, complicated and costly processes. Organizations no longer know why things get done the way they get done and the inertia is hard to resist. The goal of process mining is to streamline processes, reduce cost and time, while increasing customer satisfaction and compliance.

From risk assessments, underwriting automation, claims processing to back office operations, all processes can be improved with the help of machine learning. Due to the multi-faceted nature of process mining, it’s important to align efforts with business goals.

Get hands-on with synthetic training data

The world's most powerful, accurate and privacy-safe synthetic data generator is at your fingertips. Generate your first synthetic dataset now for free straight from your browser. No credit card or coding required.

The four most inspiring AI use cases in insurance with examples

Startups are full of ideas and are not afraid to turn them into reality. Of course, it’s much easier to do something revolutionary from the ground up, then to tweak an old structure into something shiny and new. Still, there are tips and tricks ready to be picked up from young, ambitious insurtech companies. Go fast and break things should not be one of them though, at least not in the insurance industry.

1. Risk assessment with synthetic geospatial imagery

How about going virtual with remote risk assessments? Today’s computer vision technology certainly makes it possible, where visual object detection does the trick. These models can assess the risks associated with a property just by looking at them - they recognize a pool, a rooftop, or a courtyard. They have a good idea of the size of a property and its location.

In order to train these computer vision programs and to increase their speed and accuracy, synthetic images are used for training them. AI-powered, touchless damage inspections are on offer too for car insurers, ready to buy third party AI solutions.

2. Detect fraud and offer personalized care to members

Anthem Inc teamed up with Google Cloud to create a synthetic data platform that will allow them to detect fraud and offer personalized care to members. This way, medical histories and healthcare claims can be accessed without privacy issues and used for validating and training AI designed to spot financial or health-related anomalies that could be signs of fraud or undetected health issues, that could benefit from early interventions.

Open data platforms are becoming more and more common driven by synthetic data technology. Humana, a large North American health insurance provider published synthetic medical records for research and development while keeping the privacy of their customers intact.

3. AI-supported customer service

Natural Language Processing is one of the areas in artificial intelligence which experienced a massive boom in recent years. The transcripts of calls are a treasure trove of intelligence, allowing insurance companies to detect unhappy customers with sentiment analyses, preventing churn with pre-emptive actions and reducing costs on the long run.

By monitoring calls for long pauses, companies can identify customer service reps who might be in need of further training to improve customer experience. Customer service reps can also receive AI-generated help in the form of automatically created summaries of customer histories with the most important, likely issues flagged for attention.

Using transcripts containing sensitive information for training AI systems could be an issue and to prevent privacy issues, AI-generated synthetic text can replace the original transcripts for training purposes. Conversational AI also needs plenty of meaningful training data, otherwise you’ll end up with chatbots destroying your reputation faster than you can build it back up. 

The six most important synthetic data use cases in insurance

When we talk to insurance companies about AI, the question of synthetic data value comes up quickly. Making sure that companies take the highest value out of synthetic data is crucial. However, not all use cases come with the same value. For example, when it comes to analytics, the highest value use cases come with the highest need for synthetic data assets.

Synthetic data value chart in analytics

AI-generated synthetic data is not only an essential part of the analytics landscape. Here are the most important synthetic data use cases in insurance, ready to be leveraged.

Data augmentation for AI and machine learning development

The power of generative AI is evident when we see AI-generated art and synthetic images. The same capabilities are available for tabular data, adding creativity and flexibility to data generation. Unlike other data transformation tools, AI-powered ones are capable of handling datasets as a whole, with relationships between the data points and the intelligence of the data kept intact.

Examples of these generative data augmentation capabilities for tabular data include rebalancing, imputation and the use of different generation moods from conservative to creative. To put it simply, MOSTLY AI's synthetic data generator can be used for designing data, not only to generate it. Your machine learning models trained on synthetic data can benefit from entire new realities, synthesized in accordance with predefined goals, such as fairness or diversity.

Data privacy

Synthetic data generators were first built to overcome regulatory limitations imposed on data sharing. By now, synthetic data generators are a full-fledged privacy-enhancing technology with robust, commercially available solutions and somewhat less reliable open source alternatives. Synthetic data is a powerful tool to automate privacy and increase data access across organizations, however, not all synthetic data is created equal. Choose the most robust solution offering automated privacy checks and a support team experienced with insurance data.

Explainable AI

AI is not something you build and can forget about. Constant performance evaluations and retraining are needed to prevent model drift and validate decisions. Insurance companies are already creating synthetic datasets for retraining models and to improve their performance. With AI-regulations soon coming into effect, providing explainability to AI models in production will become a necessary exercise too.

Modern AI and machine learning models work with millions of model parameters, that are impossible to understand unless observed in action. Transparent algorithmic decisions need shareable, open data assets, allowing the systematic exploration of model behavior.

Synthetic data is a vital tool for AI explainability and local interpretability. Providing a window into the souls of algorithms, data-based explanations and experiments will become standard parts of the AI lifecycle. AI in insurance is likely to get frequent regulatory oversight, which makes synthetic training data a mission-critical asset.

Data sharing

Since insurance companies tend to operate in a larger ecosystem made up of an intricate sales network, individual agents, travel agencies, service providers and reinsurers, effective, easy and compliant data sharing is a mission-critical tool to have at hand.

At MOSTLY AI, we have seen large insurance companies flying data scientists over to other countries in order to access data in a compliant way. Life and non-life business lines are strictly separated, making cross-selling and intelligence sharing impossible to do. Cloud adoption is the way to go, however, it’s still a distant future for a lot of traditional insurance companies, whose hands are tied by regulatory compliance. Insurance companies should consider using synthetic data for data sharing.

Since AI-generated synthetic data retains the intelligence contained within the original data it was modelled on, it provides a perfect proxy for data-driven collaboration and other datasharing activities. Synthetic data can safely and compliantly be uploaded to cloud servers, giving a much needed boost to teams looking to access high-quality fuel for their AI and machine learning projects.

Humana, one of the largest North American health insurance providers created a Synthetic Data Exchange to facilitate the development of new products and solutions by independent developers.

IoT driven AI

Smart home devices are the most obvious example of how sensor data can be leveraged for predictive analytics to predict, detect and even prevent insurance events. Home insurance products paired with smart home electronics can offer a powerful combination with competitive pricing and tons of possibilities for cost reduction.

Telematics can and most likely will revolutionize the way cars are insured, alerting drivers and companies of possibly dangerous driving behaviors and situations. Health data from smart watches can pave the way for early interventions and prevent health issues on time. Win-win situations for providers and policy holders as long as data privacy can be guaranteed.

Software testing

Insurance companies develop and maintain dozens if not hundreds of apps serving customers, offering loyalty programs, onboarding agents and keeping complex marketing funnels flowing with contracts. Bad test data can lead to serious flaws, however, production data is off-limits to test engineers both in-house and off-shore. Realistic, production-like synthetic test data can fill the gap and provide a quick and high-quality alternative to manually generated mock data and to unsafe production data.

How to make AI happen in insurance companies in five steps

1. Build lighthouse projects

In order to convince decision makers and get buy-in from multiple stakeholders, it’s a good idea to build lighthouse projects, designed to showcase what machine learning and AI in insurance is capable of.

2. Synthesize data proactively

Take data democratization seriously. Instead of siloing data projects and data pools and thinking by departments, synthesize core data assets proactively and let data scientists and machine learning engineers pull synthetic datasets on-demand from a pre-prepared platform. For example, synthetic customer data is a treasure trove of business intelligence, full of insights, which could be relevant to all teams and use cases. 

3. Create a dedicated interdepartmental data science role

In large organizations, it’s especially difficult to fight against the inertia of massively complicated systems and departmental information silos. Creating a role for AI enablement is a crucial step in the process. Assigning the responsibility of collecting, creating and serving data assets and coordinating, facilitating and driving machine learning projects across teams from zero to production is a mission-critical piece of the organizational puzzle. 

4. Create the data you need

Market disruptors are not afraid to think creatively about their data needs and traditional insurance companies need to step up their game to keep the pace. Creating loyalty programs for learning more about your customers is a logical first step, but alternative data sources could come in many forms.

Think telematics for increasing drivers’ safety or using health data from smart devices for prevention and for designing early intervention processes could all come into play. Improving the quality of your existing data assets and enriching them with public data are also possible by using synthetic data generation to upsample minority groups or to synthesize sensitive data assets without losing data utility. Think of data as a malleable, flexible modelling clay that is the material for building AI and machine learning systems across and even outside the organization.

5. Hire the right people for the right job

The hype around AI development is very real, resulting in a job market saturated with all levels of expertise. Nowadays, anyone who has seen a linear regression model up close might claim to be an experienced AI and machine learning engineer. To hire the best, most capable talent, ready to venture into advanced territories, you need to have robust processes in place with tech interviews designed with future tasks in mind. Check out this excellent AI engineer hiring guide for hands-on tips and guidance on hiring best practices. AI in insurance needs not only great engineers, but a high-level of domain knowledge.

TL;DR

An overview of synthetic data generation methods

Not all synthetic data is created equal and in particular, synthetic data generation methods today are very different from what they were 5 years ago. Let’s take a look at different methods of synthetic data generation from the most rudimental forms to the state-of-the-art methods to see how far the technology has advanced! In this post we will distinguish between three major methods:

Comparison of synthetic data types

Which synthetic data generation method should you choose? Evaluation metrics 101

The choice of method depends on the use case and should be evaluated - if possible - both by an expert on the data synthesis and by a domain expert, who is familiar with the data and its downstream usage.
In addition to use-case-specific criteria, several general aspects can be used to evaluate and compare the different synthetic data generation methods available.

The stochastic process: when form matters more than content

If the structure of the desired synthetic data is known, and the data distribution is irrelevant - when random noise is all you need - the stochastic process is a perfect synthetic data generation method.

An example would be where the synthetic dataset should take the form of a CSV file with a specific number of columns and rows. A random number generator can be used to fill the fields following a defined distribution.

The applicability of such a process is limited to cases where the content of the synthetic data is irrelevant and random noise is good enough in place of real data. Examples of such applications would be stress testing systems, where a huge amount of random data is generated on the fly to evaluate how systems behave under heavy use.

Rule-based synthetic data generation: the human-powered machine

The obvious downside to synthetic data generation methods using stochastic processes is their limited use-cases since the resulting data is random and contains no real information.
Rule-based synthetic data generation methods improve on that by using hand-generated data following specific rules defined by humans.

The complexity of those rules can vary from very simple, taking only the desired data type of a column into account (i.e. if a column contains numeric, categorical, or text data), to more sophisticated rules, that define relationships between various columns and events. The amount of human labor and expertise needed, as well as the information contained in the generated data, are therefore completely dependent on the defined rules.

Thus, rule-based synthetic data generation methods come with three additional challenges:

Coping with these challenges can be very difficult, and in many cases, they prove to be deal-breakers. Specifically, Scalability and Drift prevent rule-based systems to be used in applications that require flexibility and support for changing data requirements, effectively limiting its applicability to use cases where the scope and the data requirements are exactly known and will not change. But if these challenges are successfully met, a rule-based system can be a good enough choice for testing in applications, ranging from the generation of tabular data to multimedia content.

However, in any case, no additional information can be extracted from the rule-based synthetic data, than what was already known beforehand and manually encoded into the rules. Thus, these datasets offer no value for analytics, nor for decision support, nor for training machine learning models.

Several web-based tools exist where one can manually define the structure and simple rules to generate tabular data. These kinds of synthetic data generation methods can then be used for testing purposes in software development, or integration tests, ranging from the most typical to the testing of specific edge cases.

AI-generated synthetic data: learning by example

Generative AI has revolutionized many things, synthetic data generation methods being one of the prime examples. Synthetic data generation methods using generative algorithms replace code with data. The rules of rule-based synthetic data generation are inherently contained in data samples, upon which AI-powered synthetic data generators are trained. Generative AI models are a class of statistical models that learn the distribution of training data and can be used to generate new data following that distribution.

Applying generative models from machine learning, it is possible to train a machine learning model (e.g. an artificial neural network) with real data so that it learns the structure and the information contained and is able to generate new synthetic data.

Synthetic data generators can be open source, like MIT's synthetic data vault or proprietary, like MOSTLY AI's synthetic data platform. When comparing synthetic data quality, MOSTLY AI's robust commercial solution outperformed SDV in a research conducted by the Joint Reserach Centre of the European Commission.

The human guidance needed by such a system can be minimal. In the best case, no human interaction is needed and the machine learning model is trained automatically.

The complexity of the data that can be learned by such a model is, primarily, limited by the data available and the model capacity (i.e. model architecture and hyperparameters). If the data requirements change, no significant adjustments are needed, simply a new model needs to be trained on the actual data.

Due to the power of machine learning models mimicking the training data, three new challenges unique to this synthetic data generation method have to be addressed:

Once these challenges have been met, the applications of AI-powered synthetic data generation methods are almost limitless, and even go beyond what is possible with real data.
Two unique opportunities arise with the use of generative models. One is the use of synthetic data in place of the original data that cannot be accessed because of legal and privacy reasons, and the second is the use of synthetic data within a company to reduce the development time of machine learning models.

One such example, where synthetic data is playing a key role in unlocking original data protected for privacy reasons, is in finance. Here, synthetic data is used for example to improve fraud detection, since it contains the statistical properties necessary for improving fraud detection systems, without exposing the privacy of individuals.

Sharing data across departments and even country borders becomes a seamless process when using a high quality, accurate and privacy compliant deep generative tool, like our very own MOSTLY AI synthetic data platform. As shown in our benchmarking study, MOSTLY AI is the world’s most accurate deep generative tool, which makes the most of all deep generative model advantages, such as the highest levels of statistical utility.

Another example is the use of synthetic data by data science, machine learning and business intelligence units. In most working environments, data access is strictly regulated resulting in time-consuming processes. Working with synthetic data instead of the original makes it possible to build models much faster and to reduce model-to-market time.

We have shown in a recent study that models trained on synthetic data achieve comparable results, and in some cases even outperform models trained on original data. AI model development is increasingly relying on synthetic training data due to the possibilities of data augmentation during the synthesization process, turning data into modelling clay for data scientists and machine learning engineers.

Test data generation can also massively benefit from the power of AI. Synthetic test data generators can pick up on business rules encoded in the production data and automatically recreate them. The resulting synthetic test data is highly realistic and covers many more business cases than manual data generation ever could.

For a more in-depth introduction to generative models, take a look at this blog post by OpenAI or Stanford’s free course on the subject.

Rundown

Here is a summary of the synthetic data generation methods compared and their performance on the metrics used to evaluate them.

Comparison of synthetic data types
Comparison of synthetic data generation methods

How to choose the right synthetic data generation method?

After having discussed the capabilities and challenges of the various synthetic data generation methods, how do you decide which one matches best the requirements and use cases at hand? Two simple questions shall guide your decision:

Solutions making use of stochastic processes and rule-based systems are highly dependent on their use case, and almost always require the development of new property software. Libraries like cayenne or PyRATA can support such efforts, but require expertise, resources, and will to be maintained.

If the synthesized data has to be realistic, stochastic processes are out of the question and rule-based systems only make sense if it is clear what the data should look like and that description can be written in code.

If in-house development is not an option and the synthetic data has to be as realistic and representative as possible, the use of ML-enabled systems as a service is the best course of action.

How is MOSTLY AI making the most out of generative models?

With MOSTLY AI's synthetic data platform we are addressing the unique challenges coming with generative models. Our synthetic data generation platform comes with built-in privacy safeguards, preventing overfitting, and eliminating the risk of re-identification. Synthetic data is exempt from data privacy regulations, freeing up your previously untouched data assets for sharing and utilization. What’s more, generated synthetic datasets come with automated quality assurance reports which make the assessment of the quality quick and painless.

We at MOSTLY AI are proud to serve customers worldwide. Our clients use synthetic data for finance, insurance and telecommunications use cases. Curious? Head over to the browser version of our product and generate synthetic data yourself for free forever, up to 100K rows a day! Hopefully, this post has been useful to you and provided you with a better understanding of how synthetic data generation has evolved from simple stochastic processes to sophisticated deep generative models. Feel free to reach out and contact us, either with feedback, questions, or any other concerns.

The agile and DevOps transformation of software testing has been accelerating since the pandemic and there is no slowing down. Applications need to be tested faster and earlier in the software development lifecycle, while customer experience is a rising priority. However, good quality, production-like test data is still hard to come by. Up to 50% of the average tester‘s time is spent waiting for test data, looking for it, or creating it by hand. Test data challenges plague companies of all sizes from the smaller organizations to enterprises. What is true in most fields, also applies in software testing: AI will revolutionize testing. AI-powered testing tools will improve quality, velocity, productivity and security. In a 2021 report, 75% of QA experts said that they plan to use AI to generate test environments and test data. Saving time and money is already possible with readily available tools like synthetic test data generators. According to Gartner, 20% of all test data will be synthetically generated by 2025. And the tools are already here. But let's start at the beginning.

What is test data?

The definition of test data depends on the type of test. Application testing is made up of lots of different parts, many of which require test data. Unit tests are the first to take place in the software development process. Test data for unit tests consist of simple, typically small samples of test data. However, realism might already be an important test data quality. Performance testing or load testing requires large batches of test data. Whichever stage we talk about, one thing is for sure. Production data is not test data. Production data should never be in test environments. Data masking, randomization, and other common techniques do not anonymize data adequately. Mock data and AI-generated synthetic data are privacy-safe options. The type of test should decide which test data generation should be used.

What is a synthetic test data generator? 

Synthetic test data is an essential part of the software testing process. Mobile banking apps, insurance software, retail and service providers all need meaningful, production-like test data for high quality QA. There is a confusion around the synthetic data term with many still thinking of synthetic data as mock or fake data. While mock data generators are still useful in unit tests, their usefulness is limited elsewhere. Similar to mock data generators, AI-powered synthetic test data generators are available online, in the cloud or on premise, depending on the use case. However, the quality of the resulting synthetic data varies widely. Use a synthetic test data generator that is truly AI-powered, retains the data structures, the referential integrity of the sample database and has additional, built-in privacy checks when generating synthetic data.

Accelerate your testing with synthetic test data

Get hands-on with MOSTLY AI's AI-powered platform and generate your first synthetic data set!

What is synthetic test data? The definition of AI-generated synthetic test data (TL;DR: it's NOT mock data)

Synthetic test data is generated by AI, that is trained on real data. It is structurally representative, referential integer data with support for relational structures. AI-generated synthetic data is not mock data or fake data. It is as much a representation of the behavior of your customers as production data. It’s not generated manually, but by a powerful AI engine that is capable of learning all the qualities of the dataset it is trained on, providing 100% test coverage. A good quality synthetic data generator can automate test data generation with high efficiency and without privacy concerns. Customer data should always be used in its synthetic form to protect privacy and to retain business rules embedded in the data. For example, mobile banking apps should be tested with synthetic transaction data, that is based on real customer transactions. 

Test data types, challenges and their synthetic test data solutions

Synthetic data generation can be useful in all kinds of tests and provide a wide variety of test data. Here is an overview of different test data types, their applications, main challenges of data generation and how synthetic data generation can help create test data with the desired qualities.

Test Data TypesApplicationChallengeSolution
Valid test data

The combination of all possible inputs.
Integration, interface, system and regression testingIt’s challenging to cover all scenarios with manual data generation. Maintaining test data is also extremely hard.Generate synthetic data based on production data.
Invalid (erroneous) test data

Data that cannot and should not be processed by the software.
Unit, integration, interface, system testing, security testingIt is not always easy to identify error conditions to test because you don't know them a priori. Access to production errors is necessary but will also not yield previously unknown error scenarios.Create more diverse test cases with synthetic data based on production data.
Huge test data

Large volume test data for load and stress testing.
Performance testing, stress testingLack of sufficiently large and varied batches of data. Simply multiplying production data does not simulate all the components of the architecture correctly. Recreating real user scenarios with the right timing and general temporal distribution with manual scripts is hard.Upsample production data via synthesization.
Boundary test data

Data that is at the upper or lower limits of expectations.
Reliability testingLack of sufficiently extreme data. It’s impossible to know the difference between unlikely and impossible for values not defined within lower and upper limits, such as prices or transaction amounts.Generate synthetic data in creative mode or use contextual generation.
Test data types and their synthetic data solutions

How to generate synthetic test data using AI

Generate synthetic data for testing using a purpose-built, AI-powered synthetic data platform. Some teams opt to build their own synthetic data generators in-house, only to realize that the complexity of the job is way bigger than what they signed up for. MOSTLY AI’s synthetic test data generator offers a free forever option for those teams looking to introduce synthetic data into their test data strategy.

This online test data generator is extremely simple to use:

  1. Connect your source database
  2. Define the tables where you want to protect privacy
  3. Start the synthesization
  4. Save the synthetic data to your target database

The result is structurally representative, referential integer data with support for relational structures. Knowing how to generate synthetic data starts with some basic data preparation. Fear not, it's easy and straightforward, once you understand the principles, it will be a breeze.

Do you need a synthetic test data generator?

If you are a company building a modern data stack to work with data that contains PII (personally identifiable information), you need a high quality synthetic data generator. Why? Because AI-generated synthetic test data is a different level of beast where generating a few tables won’t cut it. To keep referential integrity, MOSTLY AI can directly connect to the most popular databases and synthesize directly from your database. If you are operating in the cloud, it makes even more sense to test with synthetic data for security purposes.

Synthetic test data advantages

Synthetic data is smarter

Thanks to the powerful learning capabilities of the AI, synthetic data offers better test coverage, resulting in fewer bugs and higher reliability. You’ll be able to test with real customer stories and improve customer experience with unprecedented accuracy. High-quality synthetic test data is mission-critical for the development of cutting edge digital products.

Synthetic data is faster

Accelerated data provisioning is a must-have for agile software development. Instead of tediously building a dataset manually, you can let AI do the heavy lifting for you in a fraction of the time.

Synthetic data is safer

Built-in privacy mechanisms prevent privacy leaks and protect your customers in the most vulnerable phases of development. Radioactive production data should never be in test environments in the first place, no matter how secure you think it is. Legacy anonymization techniques fail to provide privacy, so staying away from data masking and other simple technuiques is a must.

Synthetic data is flexible

Synthesization is a process that can change the size of the data to match your needs. Upscale for performance testing or subset for a smaller, but referentially correct dataset.

Synthetic data is low-touch

Data provisioning can be easily automated by using MOSTLY AI’s Data Catalog function. You can save your settings and reuse them, allowing your team to generate data on-demand. 

What’s wrong with how test data has been generated so far?

A lot of things. Test data management is riddled with costly bad habits. Quality, speed and productivity suffer unnecessarily if your team does any or all of the following:

1.) Using production data in testing  

Just take a copy of production and pray that the unsecure test environment won’t leak any. It’s more common than you’d think, but that doesn’t make it ok. It’s only a matter of time before something goes wrong and your company finds itself punished by customers and regulators for mishandling data. What’s more, actual data doesn’t cover all possible test cases and it’s difficult to test new features with data that doesn’t yet exist.

2.) Using legacy anonymization techniques

Contrary to popular belief, adding noise to the data, masking data or scrambling it doesn’t make it anonymous. These legacy anonymization techniques have been shown time and again to endanger privacy and destroy data utility at the same time. Anonymizing time-series, behavioral datasets, like transaction data, is notoriously difficult. Pro-tip: don’t even try, synthesize your data instead. Devs often have unrestricted access to production data in smaller companies, which is extremely dangerous. According to Gartner, 59% of privacy incidents originate with an organization’s own employees. It may not be malicious, but the result is just as bad.

3.) Generate fake data

Another very common approach is to use fake data generators like Mockito or Mockaroo. While there are some test cases, like in the case of a brand new feature, when fake data is the only solution, it comes with serious limitations. Building datasets out of thin air costs companies a lot of time and money. Using scripts built in-house or mock data generation tools takes a lot of manual work and the result is far from sophisticated. It’s cumbersome to recreate all the business rules of production data by hand, while AI-powered synthetic data generators learn and retain them automatically. What’s more, individual data points might be semantically correct, but there is no real "information" coming with fake data. It's just random data after all. The biggest problem with generating fake data is the maintenance cost. You can start testing a new application with fake data, but updating it will be a challenge. Real data changes and evolves while your mock test data will become legacy quickly.

4.) Using fake customers or users to generate test data

If you have an army of testers, you could make them behave like end users and create production like data through those interactions. It takes time and a lot of effort, but it could work if you are willing to throw enough resources in. Similarly, your employees could become these testers, however, test coverage will be limited and outside your control. If you need a quick solution for a small app, it could be worth a try, but protecting your employees’ privacy is still important.  

5.) Canary releases for performance and regression tests

Some app developers push a release to a small subset of their users first and consider performance and regression testing done. While canary testing can save you time and money in the short run, long term your user base might not appreciate the bugs they encounter on your live app. What’s more, there is no guarantee that all issues will be detected.


It’s time to develop healthy test data habits! AI-generated synthetic test data is based on production data. As a result, the data structure is 100% correct and it’s really easy to generate on-demand. What’s more, you can create even more varied data than with production data covering unseen test cases. If you choose a mature synthetic data platform, like MOSTLY AI’s, built-in privacy mechanisms will guarantee safety. The only downside you have to keep in mind is that for new features you’ll still have to create mock data, since AI-generated synthetic data needs to be based on already existing data. 

Test data management in different types of testing

test data types
Data types for different phases of software testing

FUNCTIONAL TESTING

Unit testing

The smallest units of code tested with a single function. Mock data generators are often used for unit tests. However, AI-generated synthetic data can also work if you take a small subset of big original production datasets.

Integration testing

The next step in development is integrating the smallest units. The goal is to expose defects. Integration testing typically takes place in environments populated with meaningful, production-like test data. AI-generated synthetic data is the best choice, since relationships of the original data are kept without privacy sensitive information.

Interface testing

The application’s UI needs to be tested through all possible customer interactions and customer data variations. Synthetic customers can provide the necessary realism for UI testing and expose issues dummy data couldn’t.

System testing

System testing examines the entire application with different sets of inputs. Connected systems are also tested in this phase. As a result, realistic data can be mission-critical to success. Since data leaks are the most likely to occur in this phase, synthetic data is highly recommended for system testing.

Regression testing

Adding a new component could break old ones. Automated regression tests are the way forward for those wanting to stay on top of issues. Maximum test data coverage is desirable and AI-generated synthetic data offers just that.

User acceptance testing

In this phase, end-users test the software in alpha and beta tests. Contract and regulatory testing also falls under this category. Here, suppliers and vendors or regulators test applications. Demo data is frequently used at this stage. Populating the app with hyper-realistic synthetic data can make the product come to life, increasing the likelihood of acceptance.

NON-FUNCTIONAL TESTING

Documentation testing

The documentation detailing how to use the product needs to match how the app works in reality. The documentation of data-intensive applications often contain dataset examples for which synthetic data is a great choice.

Installation testing

The last phase of testing, before the end-user takes over, testing the installation process itself to see if it works as expected.

Performance testing

Tests how a product behaves in terms of speed and reliability. The data used in performance testing must be very close to the original. That’s why a lot of test engineers use production data treated with legacy anonymization techniques. However, these old-school technologies, like data masking and generalization, destroy both insights and come with very weak privacy.

Security testing

Security testing’s goal is to find risks and vulnerabilities. The test data used in security testing needs to cover authentication, like usernames and passwords as well as databases and file structures. High-quality AI-generated synthetic data is especially important to use for security testing.

Test data checklist

  1. Adopt an engineering mindset to test data across your team.
  2. Automate test data generation as much as you can.
  3. Develop a test data as a service mindset and provide on-demand access to privacy safe synthetic data sandboxes.
  4. Use meaningful, AI-generated smart synthetic test data whenever you can.
  5. Get management buy-in for modern privacy enhancing technologies (PETs) like synthetic data.

Health insurance companies have long been in the front lines of data-driven decision making. From advanced analytics to AI and machine learning use cases in insurance applications, data is everywhere. Increasing the accessibility of these most valuable datasets is a business-critical mission, that is ripe for a synthetic data revolution. The Humana synthetic data sandbox is a prime example of how the development of data-centric products, such as those using AI and machine learning can be accelerated.

Humana, the third largest health insurance provider in the U.S. published a synthetic data exchange platform. The aim is to unleash new insights and bring advanced products to the market. The data exchange offers access to synthetic patient data with a total of 1 500 000 synthetic records that is representative of Humana’s member population.

The data challenge in health insurance

Data sharing in a highly regulated and sensitive environment is a hard, slow and often painful process. Legal and regulatory pressures make it impossible to collaborate with external vendors efficiently. What’s more, health insurance providers want to be good shepherds of the sensitive data their patients trust them with. The Humana synthetic data exchange allows product developers to run faster tests and learn and to deliver better value to its members. All of this, while keeping their personal healthcare information perfectly safe.

Healthcare data platforms are not only benefiting health insurance companies, but are used to accelerate research and policy making across the world.

Our favorite synthetic data innovation hub

To overcome challenges, Humana set up a synthetic data sandbox. Using these granular, high-quality synthetic datasets, the relationships between different variables of interest as well as the important context in which patient care takes place are preserved. Developers and data scientists can identify where the care journey has taken a synthetic individual, how they interacted with different sites of care and maintain the specificity of the data without being identifiable.

A synthetic data dashboard gives instant access to the data. Sample datasets can be downloaded for schema and data quality exploration. Plus a comprehensive data dictionary serves as documentation. The synthetic datasets provide data on demographics and coverage details, medical and pharmacy claims, dates, diagnosis, sites of care with maintained correlations and relationships throughout. This is exactly how a synthetic data sandbox should be.

The advantages of a synthetic data sandbox

By offering easy and safe access to high quality synthetic data, Humana gives developers the most important ingredient for successful product development. This granular, as-good-as-real source of knowledge is invaluable in identifying cohorts and improving customer experience. Proof of value is also much easier to come by. A solution’s benefits and accuracy, especially of machine learning applications, will only show themselves if the data is hyper-realistic.

New tools developed with realistic synthetic data assets allow the insurance provider to assess where and how to use them along the care journey. The Humana synthetic data sandbox allows product developers to work in a production-like data-environment without the security risks or lengthy access processes.

The Humana synthetic data sandbox provides realistic, granular data samples

Synthetic healthcare data is on the rise in Europe too

On the other side of the Atlantic, the European Union has published a research paper in which they generated synthetic versions of 2 million cancer patient's records. According to their assessment, the resulting synthetic data's accuracy (98.8%) makes it suitable for collaborative research projects across institutions and countries.

The synthetic patient data was also rebalanced during the synthesization process, making it represent minority groups better. This is crucial for training machine learning models, which might not be able to pick up on rare cancer types. The EU's Join Research Center expects that synthetic data will revolutionize medical AI by eliminating the data hurdle.

The Humana synthetic data blueprint for healthcare data management

AI-generated synthetic data is the tool enabling a wide variety of data-driven use cases in healthcare and health insurance - fighting cancer is only one of them. Humana’s Data Exchange offers real hope for the acceleration of health innovation. Without data, nothing is possible.

Humana is way ahead of others in their synthetic data journey and are already exploring ways to use synthetic data for ethical AI. Undoubtedly, this is one of the most exciting use cases of synthetic data, providing fairness, privacy and explainability to AI and machine learning models. Check out the Fair synthetic data and ethical AI in healthcare podcast episode, where Laura Mariano, Lead Ethical AI Data Scientist and Brent Sundheimer, Principal AI Architect at Humana explain how fair synthetic data helps them create fair predictions!

magnifiercross