💡 Download the complete guide to AI-generated synthetic data!
Go to the ebook

There is something different about Merkur Versicherung AG. It’s the oldest insurance company in Austria, but it doesn’t feel like it.

For starters, there’s the Campus HQ in Graz. An Illuminous race track lines the floor of the open plan space. The vibrant lobby is filled with eclectic artwork and unconventional furniture. And there’s a beautiful coffee dock beside the fully functioning gym in the lobby.

Then, there’s the people. One team in particular stands out amongst the crowd: the Merkur Innovation Lab. A group of self professed “geeks, data wizards, future makers” with some “insurance guys” thrown in for good measure. Insurance innovation is born right here. Daniela Pak-Graf, the managing director of Merkur Innovation Lab — the innovation arm of Merkur Insurance, told us in the Data Democratization Podcast:

“Merkur Innovation Lab is the small daughter, the small startup of a very old company. Our CEO had the idea, we have so much data, and we're using the data only for calculating insurance products, calculating our costs, and in the era of big data of Google, of Amazon, Netflix, there have to be more possibilities for health insurance data too. He said, "Yes,  a new project, a new business, what can we do with our data?" Since 2020, we are doing a lot.”

Oh, and then there’s synthetic health data.

The Merkur Innovation Lab has fast become a blueprint for other organizations looking to develop insurance innovations by adopting synthetic data. In the following, we’ll introduce three insurance innovations powered by synthetic data adoption.

Insurance innovation no. 1: data democratization


Like many other data-driven teams, the Merkur Innovation Lab team faced the challenge of ensuring data privacy while still benefiting from valuable insights. The team experimented with data anonymization and aggregation but realized that it fell short of providing complete protection. The search for a more comprehensive solution led them to the world of synthetic data.


According to Daniela Pak-Graf, the solution to the problem is synthetic data:

"We found our way around it, and we are innovating with the most sensitive data there is, health data. Thanks to MOSTLY."

Merkur didn’t waste time in leveraging the power of synthetic data to quickly unlock the insights contained within their sensitive customer data. The team has created a beautifully integrated and automated data pipeline that enables systematic synthetic data generation on a daily basis, fueling insurance innovations across the organization. Here’s how they crafted their synthetic data pipeline:


The end-to-end automated workflow has cut Merkur’s time-to-data from 1-month, to 1-day. The resulting synthetic granular health data is read into a dynamic dashboard to showcase a tailored ‘Monetary Analysis’ of Merkur’s customer population. And the data is available for consumption by anyone at any time. True data democratization and insurance innovation on the tap.

Insurance innovation no. 2: external data sharing


As we know, traditional data sharing approaches, particularly in sensitive industries like health and finance, often faced complexity due to regulatory constraints and privacy concerns. Synthetic data offered a quick and secure solution to facilitate data collaboration, without which scaling insurance innovations would be impossible.


According to Daniela:

“...one of the biggest opportunities is working with third parties. When speaking to other companies, not only insurance companies, but companies working with health data or customer data, there's always the problem, "How can we work together?" There are quite complex algorithms. I don't know, homomorphic encryption. No one understands homomorphic encryption, and it's not something which can be done quickly. Using synthetic data, it's a quick fix if you have a dedicated team who can work with synthetic data.”


One exciting collaboration enabled by synthetic data is Merkur Innovation Lab’s work with Stryker Labs. Stryker Labs is a startup focused on providing training management tools for professional athletes. The collaboration aims to extend the benefits of proactive healthcare and injury prevention to all enthusiasts and hobby athletes by merging diverse datasets from the adjacent worlds of sport and health. Daniela explained the concept:

“The idea is to use their expertise and our knowledge about injuries, the results, the medication, how long with which injury you have to stay in hospital, what's the prescribed rehabilitation, and so on. The idea is to use their business idea, our business idea, and develop a new one where the prevention of injuries is not only for professional sports, but also for you, me, the occasional runner, the occasional tennis player, the occasional, I don't know.”

This exciting venture has the potential to improve the well-being of a broader and more diverse population, beyond the privileged few who make it into the professional sporting ranks.

Insurance innovation no. 3: empowering women in healthcare

Another promising aspect of synthetic data lies in its potential to address gender bias and promote fairness in healthcare. By including a more diverse dataset, synthetic data can pave the way for personalized, fairer health services for women. In the future, Merkur Innovation Lab plans to leverage synthetic data to develop predictive models and medication tailored for women; it marks a step towards achieving better healthcare equality. According to Daniela:

“...it could be a solution to doing machine learning, developing machine learning algorithms with less bias. I don't know, minorities, gender equality. We are now trying to do a few POCs. How to use synthetic data for more ethical algorithms and less biased algorithms.”

The insurance industry and innovation

Insurance companies have always been amongst the most data-savvy innovators. Looking ahead, we predict that the insurance sector will continue to lead the way in adopting sophisticated AI and analytics. The list of AI use cases in insurance continues to grow and with it, the need for fast and privacy safe data access. Synthetic data in insurance unlocks the vast amount of intelligence locked up in customer data in a safe and privacy-compliant way. Synthetic healthcare data platforms are becoming a focal point for companies looking to accelerate insurance innovations.

The Merkur Innovation Lab team of “geeks, data wizards, future makers” are only getting started on their synthetic data journey. However, they can already add “synthetic data trailblazers” to that list. They join a short (but growing) list of innovators in the Insurance space, like our friends, Humana, who are creating winning data-centric products with their synthetic data sandbox.

Funding for digital healthcare has doubled since the pandemic. Yet, a large gap between digital leaders and life sciences remains. The reason? According to McKinsey's survey published in 2023, a lack of high-quality, integrated healthcare data platform is the main challenge cited by medtech and pharma leaders as the reason behind the lagging digital performance. As much as 45 percent of these companies' tech investments go to applied artificial intelligence (AI), industrialized machine learning (ML), and cloud and edge computing - none of which can be realized without meaningful data access. 

Healthcare data challenges
Source: Top ten observations from 2022 in life sciences digital and analytics, McKinsey

Download the healthcare data ebook with case studies

All you need to know about how synthetic healthcare data is used in research and development with case studies about improving prediction accuracy and building synthetic health data platforms for data sharing. 

Why are healthcare data platforms mission-critical? 

Healthcare data is hard to access

Since healthcare data is one of the most sensitive data types with special protections from HIPAA and GDPR, it's no surprise that getting access is incredibly hard. MOSTLY AI's team has been working with healthcare companies and institutions closely, providing them with the technology needed to create privacy safe healthcare data platforms. We know firsthand how challenging it is to conduct research and use data-hungry technologies already standard in other industries like banking and telecommunications. 

Efforts to unlock health data for research and AI applications are already under way on a federal level in Germany. The German data landscape is an especially challenging one. The country is made up of 16 different federal states, each with its own laws and regulations as well as siloed health data repositories.

InGef, the Institute for Applied Health Research in Berlin and other renowned institutes such as Fraunhofer, Berliner Charité and the Federal Institute for Drugs and Medical Devices set out to solve this problem. This multi-year program is sponsored by the Federal Ministry of Health, a public tender MOSTLY AI won in 2022 to support the program with its synthetic data generation capabilities. The goal here is to develop a healthcare data platform to improve the secure use of health data for research purposes with as-good-as-real, shareable synthetic data versions of health records.

According to Andreas Ponikiewicz, VP of Global Sales at MOSTLY AI, 

"Enabling top research requires high quality data that is often either locked away, siloed, or sparse. Healthcare data is considered as very sensitive, therefore the highest safety standards need to be fulfilled. With generative AI based synthetic healthcare data, that contains all the statistical patterns, but is completely artificial, the data can be made available without privacy risk. This makes the data shareable to achieve better collaboration, research outcomes, diagnoses and treatments, and overall service efficiencies in the healthcare sector, which ultimately benefits society overall”.

Protecting health data should be, without a doubt, a priority. According to recent reports, healthcare overtook finance as the most breached industry in 2022, with 22% of data breaches occurring within healthcare companies, up 38% year over year. The so-called deidentification of data required by HIPAA is the go-to solution for many, even though simply removing PII or PHI from datasets never guarantees privacy.

Old-school data anonymization techniques, like data masking, not only endanger privacy but also destroy data utility, which is a major issue for medical research. Finding better ways to anonymize data is crucial for securing data access across the healthcare industry. A new generation of privacy-enhancing technologies is already commercially available and ready to overcome data access limitations. The European Commission's Joint Research Center recommends AI-generated synthetic data for healthcare and policy-making, eliminating data access issues across borders and institutions. 

Healthcare data is incomplete and imbalanced

Research and patient care suffer due to incomplete, inaccurate, and inconsistent data. Filling in the gaps is also an important step in making the data readable for humans. Human readability is especially mission-critical in health research and healthcare, where understanding and reasoning around data is part of the research process. Machine learning models and AI algorithms need to see the data in its entirety too. Any data points masked or removed could contain important intelligence that models can learn from. As a result, heavily masked or aggregated datasets won't be able to teach your models as well as complete datasets with maximum utility.

To overcome these limitations, more and more research teams turn to synthetic data as an alternative. Successfully implemented healthcare data platforms fully take advantage of the potential of synthetic data beyond privacy. Synthetic data generated from real seed data offer privacy as well as data augmentation opportunities perfect for improving the performance of machine learning algorithms. The European Commission's Joint Research Center investigated the synthetic patient data option for healthcare applications closely and found that (the):

"Resulting data not only can be shared freely, but also can help rebalance under-represented classes in research studies via oversampling, making it the perfect input into machine learning and AI models"

Rebalancing with MOSTLY AI's synthetic data platform
Data rebalancing in MOSTLY AI’s synthetic data platform

Healthcare data is biased

Data bias can take many shapes and forms from imbalances to missing or erroneous data on certain groups of people. Synthetic data generation is a promising tool that can help correct biases embedded in datasets during data preparation. The first important step is to even find the bias in the first place. The more eyes you have on the data, the better the chances of identifying hidden biases. Clearly, this is impossible with sensitive healthcare datasets. Synthetic versions of data can increase the level of access and transparency of important data assets.

Health data is the most sensitive data type

A few years ago a brand new drug, developed for treating Spinal Muscular Atrophy, broke all previous records as the most expensive therapy in the world. If you were to query supposedly anonymous health insurance databases around that time, you could easily identify children who received this drug, simply because the price was such an outlier. But even in the absence of such an extreme value, reidentifying people by linking separate data points together is not a difficult thing to do.

Surely, locking health data up and securing the environment where the data is stored is the way to go. Except that most data leaks actually originate with a company's own employees. Usually, there is no malicious intent either. A zero-trust approach helps only to an extent and fails to protect from mistakes and accidents that are bound to happen. Locking data up also means fewer data collaborations, smaller sample sizes, less intelligence, higher costs, worse predictions, and ultimately, more suffering.

For example, the improvement of predictive accuracy of machine learning models regarding health outcomes can not only save costs for healthcare providers but also decrease suffering. The more granular the data, the better the capabilities in predicting how certain cohorts of patients will react to treatments and what is the likelihood of a good outcome.

In one instance, MOSTLY AI's synthetic data generator was used for the synthetic rebalancing of patient records and predicting which patients would benefit from a therapy that can come with serious side effects. John Sullivan, MOSTLY AI's Head of Customer Experience, has seen how synthetic data generation can transform health predictions from up close. 

"We worked on a dataset for a large North American healthcare company and achieved an increase of accuracy in true positives predicted by the down-stream ML model in the range of 7-13% against a target of 5-10%. The improved model performance means potentially hundreds of patients benefit from early identification, and more importantly, early treatment of their illness. It's huge and extremely motivating." 

Considering that 85% of machine learning models don't even make it into production, synthetic training data is a massive win for data science teams and patients alike. 

Synthetic healthcare data types

Different data types are present along the patient journey, requiring a wide range of tools to extract intelligence from these data sources at scale. Healthcare data platforms should cover the entire patient journey, providing data consumers with an environment ready for 'in-silico' experiments.

Healthcare data types
Source: Databricks - Unlocking the power of health data with a modern data lakehouse

In healthcare, image data in particular receives a lot of attention. From improved breast cancer detection to AI-powered surgical guidance systems, medical science is undergoing a profound transformation. However, the AI revolution doesn't stop at image analyses and computer vision. 

Tabular healthcare data is another frontier for new artificial intelligence applications. Synthetic data generated by AI trained on real datasets is one of the most versatile tools with plenty of use cases and a robust track record, allowing researchers to collaborate even across borders and institutions.  

Examples of tabular healthcare data include electronic health records (EHR) of patients and populations, electronic medical records generated by patient journeys (EMR), lab results, and data from monitoring devices. These highly sensitive structured data types can all be synthesized for ethical, patient-protecting usage without utility loss. Synthetic health data can then be stored, shared, analyzed, and used for building AI and machine learning applications without additional consent, speeding up crucial steps in drug development and treatment optimization processes.

Rare disease research suffers from a lack of data the most. Almost always, the only way to produce significant results is to share datasets across national borders and between research teams. This process is sometimes near-impossible and excruciatingly slow at best. Researchers need to be working on the same data in order to make the same conclusions and validate findings. Synthetic versions of datasets can be shared and merged without compliance issues or privacy risks, allowing rare disease research to advance much quicker and at a significantly lower cost. On-boarding researchers and scientists to healthcare data platforms populated with synthetic data is easy to do, since the synthetic version of datasets does not legally qualify as personal data.

Use cases and benefits for AI-generated synthetic data platforms in healthcare

Let's summarize the most advanced healthcare data platform use cases and their benefits for AI-generated synthetic data. These are based on our experience and direct observations of the healthcare industry from up-close.

Machine learning model development with synthetic data

The reason why most machine learning projects fail is a lack of high-quality, large-quantity, realistic data. Synthetic data can be safely used in place of real patient data to train and validate machine learning models. Synthetic data generators can shorten time-to-market by unlocking valuable data assets by taking care of data prep steps such as data exploration at a granular level, data augmentation, and data imputation. Since data replaced code, healthcare data platforms became the most important part of MLOps infrastructures and machine learning product development.

Data synthesis for data privacy and compliance

Protected health information, or PHI, is heavily regulated both by HIPAA and GDPR. PHI can only be accessed by authorized individuals for specific purposes, which makes all secondary use cases and further research practically impossible. Organizations disregarding these rules face heavy fines and eroding patient trust. Synthetic data can be used to protect patient privacy by preserving sensitive information while still allowing for data analysis.

Testing and simulation with synthetic data generators

Synthetic data helps researchers to forecast the effects of greater sample size and longer follow-up duration on already existing data, thus informing the design of the research methodology. MOSTLY AI's synthetic data generator, in particular, is well suited to carry out on-the-fly explorations, allowing researchers to query what-if scenarios and reason around data effectively.

Data repurposing for research and development 

Often, using data for secondary purposes is challenging or downright prohibited by regulators. Synthetic data can overcome these limitations and be used as a drop-in placement to support research and development efforts. Researchers can also use synthetic data to create a so-called 'synthetic control arm.' According to Meshari F. Alwashmi, a digital health scientist:

"Instead of recruiting patients to sign up for trials to not receive the treatment (being the control group), they can turn to an existing database of patient data. This approach has been effective for interpreting the treatment effects of an investigational product in trials lacking a control group. This approach is particularly relevant to digital health because technology evolves quickly and requires rapid iterative testing and development. Furthermore, data collected from pilot studies can be utilized to create synthetic datasets to forecast the impact of the intervention over time or with a larger sample of patients."

Population health analysis for policy-making 

AI-generated synthetic data can be used to model and analyze population health, including disease outbreaks, demographics, and risk factors. According to the European Commission's Joint Research Center, synthetic data is "becoming the key enabler of AI in business and policy applications in Europe." Especially since the pandemic, the urgency to provide safe and accessible healthcare data platform on the population level is on the increase and the idea of data democratization is no longer a far-away utopia, but a strong driver in policy-making.   

Data collaborations for innovation

Creating healthcare data platforms by proactively synthesizing and publishing health data is a winning strategy we see at large organizations. Humana, one of the largest health insurance providers in North America, launched a synthetic data exchange platform to facilitate third-party product development. By joining the sandbox, developers can access synthetic, highly granular datasets and create products with laser-sharp personalizations, solving real-life problems. In a similar project, Merkur Insurance in Austria uses MOSTLY AI’s synthetic data platform to develop new services and personalized customer experiences. For example, to develop machine learning models using privacy-safe and GDPR-compliant training data. 

Ethical and explainable AI

Synthetic data generation provides additional benefits to AI and machine learning development. Ethical AI is one area where synthetic data research is advancing rapidly. Fair models and predictions need fair data inputs and the process of data synthesis allows research teams to explore definitions of fairness and their effects on predictions with fast iterations. Furthermore, the introduction of the first AI regulations is only a matter of time. With regulations comes the need for explainability. Explainable AI needs synthetic data - a window into the souls and inner workings of algorithms, without which trust can never be truly established. 

Healthcare data platforms populated with curated synthetic data assets, like Humana's Data Exchange, are the future of research and product development. Results are already convincing and proof of concept is strong. If you would like to explore what a synthetic healthcare data platform can do for your company or research, contact us and we'll be happy to share our experience and know-how.

Data intelligence is locked up. Machine learning and AI in insurance is hard to scale and legal obligations make the job of data scientists and actuaries extremely difficult. How can you still innovate under these circumstances? The insurance industry is being disrupted as we speak by agile, fast moving insurtech start-ups and new, AI-enabled services by forward-thinking insurance companies.

According to McKinsey, AI will “increase productivity in insurance processes and reduce operational expenses by up to 40% by 2030”. At the same time, old-school insurance companies struggle to make use of the vast troves of data they are sitting on and sooner or later, ambitious insurtechs will turn their B2B products into business to consumer offerings, ready to take sizeable market shares. Although traditional insurance companies’ drive to become data-driven and AI-enabled is strong, organizational challenges are hindering efforts.

Laggers have a lot to lose. If they are to stay competitive, insurance companies need to redesign their data and analytics processes and treat the development of data-centric AI in insurance with urgency.

The bird's-eye view on data in the insurance industry

Data has always been the bread and butter of insurers and data-driven decisions in the industry predate even the appearance of computers. Business-critical metrics have been the guiding force in everything insurance companies do from pricing to risk assessment. CLV, or customer lifetime value has long been one of the most important metrics for insurers.

KYC, or know your customer, is a risk assessment metric that uncovers the potential risks associated with certain customers through their data. Even today, most of these metrics are hand-crafted with traditional, rule-based tools that lack dynamism, speed and contextual intelligence. The scene is ripe for AI revolution. 

The insurance industry relies heavily on push sales techniques. Next best products and personalized offers need to be data-driven. For cross selling and upselling activities to succeed, an uninterrupted data flow across the organization is paramount, even though often Life and Non-Life business lines are completely separated. Missed data opportunities are everywhere and require a commitment to change the status quo.

Interestingly, data sharing is both forbidden and required by law, depending on the line of business and properties of data assets in question. Regulations are aplenty and vary across the globe, making it difficult to follow ambitious global strategies and turning compliance into a costly and tedious business. Privacy enhancing technologies or PETs for short can be of help and a modern data stack cannot do without them. Again, insurance companies should carefully consider how they build PETs into their data pipelines for maximum effect. 

The fragmented, siloed nature of insurance companies' data architecture can benefit hugely from using PETs, like synthetic data generators, enabling cloud adoption, data democratization and the creation of a consolidated data intelligence across the organization.

Structured data

Core System
General Ledger (Accounting)
PAS - Policy Admin System
Operational data
Regulatory software (e.g. IFRS)

Semi Structured data

Sensors (telematics, smart meters,
wearables, smart home IoT, etc.)

Unstructured data

Photos (cars, x-rays, etc.)
Satellite images

External Data Sources

Credit score data (Experian, Bloomberg etc.)
Third party data providers
Open source data (governmental statistic)
Open APIs
Synthetic data generator
Cloud/On-prem Data Storage 

Structure data

Automate Underwriting
Referral reduction
Distribution Quality Management
Pricing Optimization
Fraud Detection
Next Best Offers
Product Recommendation
Claims Triage
Personas Creation

Intelligent Automation

Claims Automation
Customer Service Support

Data Science/Analytics

Simple EDA
Process Mining
Business Process Reengineering
BI Dashboards

Data Sharing

Other business lines
Sales network

Insurtech companies willing to change traditional ways and adopting AI in insurance with earnestness have been stealing the show left, right and center. As far back as 2017, a US insurtech company, Lemonade announced to have paid out the fastest claim in the history of the insurance industry - in three seconds.

If insurance companies are to stay competitive, they need to turn their thinking around and start redesigning their business with sophisticated algorithms and data in mind. Instead of programming the data, the data should program AI and machine learning algorithms. That is the only way for truly data-driven decisions, everything else is smoke and mirrors. Some are wary of artificial intelligence and would prefer to stick to the old ways. That is simply no longer possible.

The shift in how things get done in the insurance industry is well underway already. Knowing how AI in insurance systems work and what their potential is with best practices, standards and use cases is what will make this transition safe and painless, promising larger profits, better services and frictionless products throughout the insurance market.

What drives you? The three types of AI use cases in insurance

Typically 80% of AI and machine learning use cases in insurance use tabular data, while unstructured data, such as images and text, account for the remaining 20%.  Data and data-centric AI and machine learning use cases can be categorized by the different business goals they should achieve.

1. Increase profits 

Examples of use cases include campaign optimization, targeting, next best offer/action, best rider based on contract and payment history.  According to the AI in insurance 2031 report by Allied Market Research, “the global AI in the insurance industry generated $2.74 billion in 2021, and is anticipated to generate $45.74 billion by 2031.”

2. Decrease costs

Examples of cost reduction use cases include reduced claims cost which is made up of two elements - claim amount and claim survey cost. According to KPMG, “investment in AI in insurance is expected to save auto, property, life and health insurers almost US$1.3 billion while also reducing the time to settle claims and improving customer loyalty.”

3. Increase customer satisfaction

AI-supported customer service and quality assurance AI systems can optimize claim processing and customer experience. Even small tweaks, optimized by AI algorithms can have massive effects. According to an IBM report, “claimants are more satisfied if they receive 80% of the requested compensation after 3 days, than receiving 100% after 3 weeks.”

What stops you? Five data challenges when building AI/ML models in traditional insurance companies

We spoke to insurance data practitioners about their day-to-day challenges regarding data consumption. There is plenty of room for improvement and for new tools making data more accessible, safer and compliant throughout the data product life-cycle. AI in insurance suffers from a host of problems, not entirely unrelated to each other. Here is a list of their five most pressing concerns.

1. Not enough data

Contrary to popular belief, insurance companies aren’t drowning in a sea of customer data. Since insurance companies have minimal interactions with customers in comparison with banks or telcos, there is less data available for making important decisions, health insurance being the notable exception. Typically the only time customers interact with insurance providers is when the contract is signed and if everything goes well, the company never hears from them again.

External datasources, like credit scoring data won’t give you a competitive edge either - after all it’s what all your competitors are looking at too. Also, the less data there is, the more important architecture, data quality and the ability to augment existing data assets becomes. Hence, investment into data engineering tools and capabilities should be at the top of the agenda for insurance companies. AI in insurance cannot be made a reality without meaningful, high-touch data assets.

2. Data assets are fragmented

To make things even more complicated, data sits in different systems. In some countries, insurance companies are prevented by law from merging different datasets or even managing them within the same system. For example property-related and life-related insurances often have to be separated. Data integration is therefore extremely challenging.

Cloud solutions could solve some of these issues. However, due to the sensitive nature of customer data, moving to the cloud is often impossible, and a lot of traditional insurers still keep all their data assets on premise. As a result, a well-designed data architecture is mission-critical for ensuring data consumption. That is not to day, that there are no data integrations in place already, although a consolidated view across all the different systems is hard to create.

Access authorizations are also in place as well as curated datasets, but if you want to access any data, that is not part of usual business intelligence activities, the process is very slow and cumbersome, keeping data scientists away from what they do best: come up with new insights. 

3. Cybersecurity is a growing problem

Very often cybersecurity is thought of as the task of protecting perimeters from outside attacks. But did you know, that 59% of privacy incidents originate with an organization’s own employees? No amount of security training can guarantee that mistakes won’t be made and so, the next frontier for cybersecurity efforts should tackle the data itself.

How can data assets themselves be turned into less hazardous pieces of themselves? Can old school data masking techniques withstand a privacy attack? New types of attacks are popping up attacking data with AI-powered tools, trying to reidentify individuals based on their behavioral patterns.

These AI-based re-identification attacks are yet another reason to ditch data masking and opt for new-age privacy enhancing technologies instead. Minimizing the amount of production data in use should be another goal added to the already long list of cybersec professionals.

4. Insurance data is regulated to the extreme

Since insurance is a strategically important business for governments, they are subjected to extremely high levels of regulation. While in some instances insurance companies are required by law to share their data with competitors, in others, they are prevented from using certain parts of datasets, such as gender, by law.

It’s a complicated picture with tons of hidden challenges data scientists and actuaries need to be aware of if they want to be compliant. Data stewardship is becoming an increasingly important role in managing data assets and data consumption within and outside the walls of organizations.

5. Data is imbalanced

Class distribution of the data is often imbalanced: rare events such as fraud or churn are represented in only a fraction of the data. This makes it hard for AI and machine learning models to pick up on patterns and learn to detect them effectively. This issue can be solved on more ways than one and data rebalancing is something data scientists often do.

Data augmentation is especially important for fraud and churn prediction models, where only a limited number of examples are available. Upsampling minority classes with AI-generated synthetic data examples can be a great solution, because the synthesization process results in a more sophisticated, realistic data structure, than more rudimentary upsampling methods.

The low-hanging AI/ML fruit all insurance companies should implement and how synthetic data can make models perform better

Underwriting automation

Automating underwriting processes is one of the first low hanging fruits insurance companies reach for when looking for AI and machine learning use cases in insurance. Typically, AI and machine learning systems assist underwriters by providing actionable insights derived from risk predictions performed on various data assets from third party data to publicly available datasets. The goal is to increase Straight Through Process (STP) rates as much as possible.

Automated underwriting processes are replacing manual underwriting in ever increasing numbers across the insurance industry and whoever wins the race to maximum automation, takes the cake. There are plenty of out of the box underwriting solutions promising frictionless Robotic Process Automation, ready to be deployed.

However, no model is ever finished and historical data used to train these models can go out of date, leading underwriters astray in their decision making. The data is the soul of the algorithm and continuous monitoring, updates and validation are needed to prevent model drift and optimal functioning.

The value of 6 AI in insurance use cases
AI in insurance is creating different value through different use cases

Pricing predictions

Actuaries have long been the single and mysterious source of pricing decisions. Prices were the results of massively complicated calculations only a handful of people really understood. Almost like a black box with walls made out of the most extreme performances of human intellect. Specialized tools, legal peculiarities and tradition in place of validation.

The dusty and mysterious world of actuaries is not going away any time soon due to the legal obligations insurance companies need to satisfy. However, AI and machine learning are capable of significantly improving the process with their ability to process magnitude more data than any single team of actuaries ever could. Machine learning models make especially potent pricing models. The more data they have, the better they do.

Fraud, anomaly and account take over prediction

Rule based fraud detection systems get very complicated very quickly. As a result, fraud detection is heavily expert-driven, costly to do and hard to maintain. Investigating a single potential fraud case can cost thousands of dollars. AI and machine learning systems can increase the accuracy of fraud detection and reduce the number of false positives, thereby reducing cost and allowing experts to investigate cases more closely.

With constantly evolving fraud methods, it’s especially important to be able to spot unusual patterns and anomalies quickly. Since AI needs plenty of examples to learn from, rare patterns, like fraud need to be upsampled in the training data.

This is how synthetic data generators can improve the performance of fraud and anomaly detection algorithms and increase their accuracy. Upsampled synthetic data records are better than real data for training fraud detection algorithms and they are able improve their performance by as much as 10%, depending on the type of machine learning algorithm used.

Next best offer prediction

Using CRM data, next best offer models support insurance agents in their upselling and cross selling activities. These product recommendations can make or break a customer journey. The accuracy and timing of these personalized recommendations can be significantly increased by predictive machine learning models. Again, data quality is a mission-critical ingredient - the more detailed the data, the richer the intelligence prediction models can derive from it.

To make next best action AI/ML models work, customer profiles and attributes are needed in abundance, with granularity and in compliance with data privacy laws. These recommendation systems are powerful profit generating tools, that are fairly easy to implement, given that the quality of the training data is sufficient.

Churn reduction

Even a few percentage points reduction in churn could be worth hundreds of thousands of dollars in revenue. It’s not surprising that prediction models, similar to those used for predicting next best offers or actions, are frequently thrown at the churn problem. However, identifying customers about to lapse on their bills or churn altogether is more challenging then making personalized product recommendations due to the lack of data.

Low churn rates lead to imbalanced data where the number of churning customers is too low for machine learning algorithms to effectively pick up on. Churn event censorship is another issue machine learning models find problematic, hence survival regression models are better suited to predict churn than binary classifiers. In this case, the algorithm predicts how long a customer is likely to remain a customer.

A series of churn prevention activities can be introduced for groups nearing their predicted time to event. Again, data is the bottleneck here. Loyalty programs can be great tools not only for improving customer retention in traditional ways, but also for mining useful data points for churn prediction models. 

Process mining

Insurance is riddled with massive, complicated and costly processes. Organizations no longer know why things get done the way they get done and the inertia is hard to resist. The goal of process mining is to streamline processes, reduce cost and time, while increasing customer satisfaction and compliance.

From risk assessments, underwriting automation, claims processing to back office operations, all processes can be improved with the help of machine learning. Due to the multi-faceted nature of process mining, it’s important to align efforts with business goals.

Get hands-on with synthetic training data

The world's most powerful, accurate and privacy-safe synthetic data generator is at your fingertips. Generate your first synthetic dataset now for free straight from your browser. No credit card or coding required.

The four most inspiring AI use cases in insurance with examples

Startups are full of ideas and are not afraid to turn them into reality. Of course, it’s much easier to do something revolutionary from the ground up, then to tweak an old structure into something shiny and new. Still, there are tips and tricks ready to be picked up from young, ambitious insurtech companies. Go fast and break things should not be one of them though, at least not in the insurance industry.

1. Risk assessment with synthetic geospatial imagery

How about going virtual with remote risk assessments? Today’s computer vision technology certainly makes it possible, where visual object detection does the trick. These models can assess the risks associated with a property just by looking at them - they recognize a pool, a rooftop, courtyard. They have a good idea of the size of a property and its location.

In order to train these computer vision programs and to increase their speed and accuracy, synthetic images are used for training them. AI-powered, touchless damage inspections are on offer too for car insurers, ready to buy third party AI solutions.

2. Detect fraud and offer personalized care to members

Anthem Inc teamed up with Google Cloud to create a synthetic data platform that will allow them to detect fraud and offer personalized care to members. This way, medical histories and data, healthcare claims can be accessed without privacy issues and used for validating and training AI designed to spot financial or health-related anomalies that could be signs of fraud or undetected health issues, that could benefit from early interventions.

Open data platforms are becoming more and more common driven by synthetic data technology. Humana, a large North American health insurance provider used MOSTLY AI’s synthetic data generator to synthesize medical records for research and development while keeping the privacy of their customers intact.

3. Price prediction with synthetic geodata

Home insurance pricing can be greatly improved with synthetic geodata. Similarly to how synthetic images can help home insurers predict risk, climate data, crime statistics and similar, publicly available databases can be looked at by combining them with synthetic coordinates, so as to protect the privacy of individuals, yet improve pricing accuracy at a granular level.

We all know how the same neighborhood or ZIP code can be two different worlds just a few streets down the road. However, it is much less known how painfully underinsured inland America is when it comes to flood insurance, even though some areas are prone to flooding and increasingly so in the face of climate change. Nine counties were declared disaster areas after Hurricane Ian destroyed some parts of the US in September 2022.

According to Politico, of the 1.8 million households in those nine counties, only 29 percent have federal flood insurance. The rest may or may not have private flood insurance, which is typically not part of an average home insurance package.

Insurance companies need to step up their game in making sure that not only coastal areas, but other, less obviously climate sensitive areas receive the right insurance coverage. AI could be a huge help here, however, CCPA and HIPAA forbids insurers from using personal data in model training. Synthetic home addresses can be used to train AI and machine learning models in a compliant and privacy safe way.

A North American insurer used synthetic home addresses to inject new domain knowledge into their pricing strategy. Using synthetic geodata, they could add public climate information to their home insurance pricing model, improving its accuracy significantly. The result is more competitive pricing with less risk and better offers for previously underserved customer segments.  

4. AI-supported customer service

Natural Language Processing is one of the areas in artificial intelligence which experienced a massive boom in recent years. The transcripts of calls is a treasure trove of intelligence, allowing insurance companies to detect unhappy customers with sentiment analyses, preventing churn with pre-emptive actions and reducing costs on the long run.

By monitoring calls for long pauses, companies can identify customer service reps who might be in need of further training to improve customer experience. Customer service reps can also receive AI-generated help in the form of automatically created summaries of customer histories with the most important, likely issues flagged for attention.

Using transcripts containing sensitive information for training AI systems could be an issue and to prevent privacy issues, AI-generated synthetic text can replace the original transcripts for training purposes. Conversational AI also needs plenty of meaningful training data, otherwise you’ll end up with chatbots destroying your reputation faster than you can build it back up. 

The six most important synthetic data use cases in insurance

When we talk to insurance companies about AI, the question of synthetic data value comes up quickly. It's no surprise, since insurance companies operate in the thinnest of margins. Making sure that companies take the highest value out of synthetic data is crucial. However, not all use cases come with the same value. For example, when it comes to analytics, the highest value use cases come with the highest need for synthetic data assets.

Synthetic data value chart in analytics

AI-generated synthetic data is not only an essential part of the analytics landscape. Here are the most important synthetic data use cases in insurance, ready to be leveraged.

Data augmentation for AI and machine learning development

The power of generative AI is evident when we see AI-generated art and synthetic images. The same capabilities are available for tabular data, adding creativity and flexibility to data generation. Unlike other data transformation tools, AI-powered ones are capable of handling datasets as a whole, with relationships between the data points and the intelligence of the data kept intact.

Examples of these generative data augmentation capabilities for tabular data include rebalancing, imputation and the use of different generation moods from conservative to creative. To put it simply, MOSTLY AI's synthetic data generator can be used for designing data, not only to generate it. Your machine learning models trained on designer synthetic data can benefit from entire new realities, synthesized in accordance with predefined goals, such as fairness or diversity.

Data privacy

Synthetic data generators were first built to overcome regulatory limitations imposed on data sharing. By now, synthetic data generators are full-fledged privacy-enhancing technology with robust, commercially available solutions and somewhat less reliable open source alternatives. Synthetic data is a powerful tool to automate privacy and increase data access across organizations, however, not all synthetic data is created equal. Choose the most robust solution offering automated privacy checks and a support team experienced with insurance data.

Explainable AI

AI is not something you build and can forget about. Constant performance evaluations and retraining are needed to prevent model drift and validate decisions. Insurance companies are already creating synthetic datasets for retraining models and to improve their performance. With AI-regulations soon coming into effect, providing explainability to AI models in production will become a necessary exercise too.

Modern AI and machine learning models work with millions of model parameters, that are impossible to understand unless observed in action. Transparent algorithmic decisions need shareable, open data assets, allowing the systematic exploration of model behavior.

Synthetic data is a vital tool for AI explainability and local interpretability. Providing a window into the souls of algorithms, data-based explanations and experiments will become standard parts of the AI lifecycle. AI in insurance is likely to get frequent regulatory oversight, which makes synthetic training data a mission-critical asset.

Data sharing

Since insurance companies tend to operate in a larger ecosystem made up of an intricate sales network, individual agents, travel agencies, service providers and reinsurers, effective, easy and compliant data sharing is a mission-critical tool to have at hand.

At MOSTLY AI, we have seen large insurance companies flying data scientists over to other countries in order to access data in a compliant way. Life and non-life business lines are strictly separated, making cross-selling and intelligence sharing impossible to do. Cloud adoption is the way to go, however, it’s still a distant future for a lot of traditional insurance companies, whose hands are tied by regulatory compliance. Insurance companies should consider using synthetic data for data sharing.

Since AI-generated synthetic data retains the intelligence contained within the original data it was modelled on, it provides a perfect proxy for data-driven collaborations and other datasharing activities. Synthetic data can safely and compliantly uploaded to cloud servers, giving a much needed boost to teams looking to access high-quality fuel for their AI and machine learning projects.

Humana, one of the largest North American health insurance providers created a Synthetic Data Exchange to facilitate the development of new products and solutions by independent developers. Providers are also using synthetic data to improve the accuracy of predicitve analytics in healthcare.

IoT driven AI

Smart home devices are the most obvious examples of how sensor data can be leveraged for predictive analytics to predict, detect and even prevent insurance events. Home insurance products paired with smart home electronics can offer a powerful combination with competitive pricing and tons of possibilities for cost reduction.

Telematics can and most likely will revolutionize the way cars are insured, alerting drivers and companies of possibly dangerous driving behaviors and situations. Health data from smart watches can pave the way for early interventions and prevent health issues on time. Win-win situations for providers and policy holders as long as data privacy can be guaranteed.

Software testing

Insurance companies develop and maintain dozens if not hundreds of apps serving customers, offering loyalty programs, onboarding agents and keeping complex marketing funnels flowing with contracts. Bad test data can lead to serious flaws, however, production data is off-limits to test engineers both in-house and off-shore. Realistic, production-like synthetic test data can fill the gap and provide a quick and high-quality alternative to manually generated mock data and to unsafe production data.

How to make AI happen in insurance companies in four steps

1. Build lighthouse projects

In order to convince decision makers and get buy-in from multiple stakeholders, it’s a good idea to build lighthouse projects, designed to showcase what machine learning and AI in insurance is capable of.

2. Synthesize data proactively

Take data democratization seriously. Instead of siloing data projects and data pools and thinking by departments, synthesize core data assets proactively and let data scientists and machine learning engineers pull synthetic datasets on-demand from a pre-prepared platform. For example, synthetic customer data is a treasure trove of business intelligence, full of insights, which could be relevant to all teams and use cases. 

3. Create a dedicated interdepartmental data science role

In large organizations, it’s especially difficult to fight against the inertia of massively complicated systems and departmental information silos. Creating a role for AI enablement is a crucial step in the process. Assigning the responsibility of collecting, creating and serving data assets and coordinating, facilitating and driving machine learning projects across teams from zero to production is a mission-critical piece of the organizational puzzle. 

4. Create the data you need

Market disruptors are not afraid to think creatively about their data needs and traditional insurance companies need to step up their game to keep the pace. Creating loyalty programs for learning more about your customers is a logical first step, but alternative data sources could come in many forms.

Think telematics for increasing drivers’ safety or using health data from smart devices for prevention and for designing early intervention processes could all come into play. Improving the quality of your existing data assets and enriching them with public data are also possible by using synthetic data generation to upsample minority groups or to synthesize sensitive data assets without losing data utility. Think of data as a malleable, flexible modelling clay that is the material for building AI and machine learning systems across and even outside the organization.

5. Hire the right people for the right job

The hype around AI development is very real, resulting in a job market saturated with all levels of expertise. Nowadays, anyone who has seen a linear regression model up close might claim to be an experienced AI and machine learning engineer. To hire the best, most capable talent, ready to venture into advanced territories, you need to have robust processes in place with tech interviews designed with future tasks in mind. Check out this excellent AI engineer hiring guide for hands-on tips and guidance on hiring best practices. AI in insurance needs not only great engineers, but a high-level if domain knowledge.

Health insurance companies have long been in the front lines of data-driven decision making. From advanced analytics to AI and machine learning use cases in insurance applications, data is everywhere. Increasing the accessibility of these most valuable datasets is a business-critical mission, that is ripe for a synthetic data revolution. The Humana synthetic data sandbox is a prime example of how the development of data-centric products, such as those using AI and machine learning can be accelerated.

Humana, the third largest health insurance provider in the U.S. published a synthetic data exchange platform. The aim is to unleash new insights and bring advanced products to the market. The data exchange offers access to synthetic patient data with a total of 1 500 000 synthetic records that is representative of Humana’s member population.

Download the healthcare data ebook with case studies

All you need to know about how synthetic healthcare data is used in research and development with case studies about improving prediction accuracy and building synthetic health data platforms for data sharing. 

The data challenge in health insurance

Data sharing in a highly regulated and sensitive environment is a hard, slow and often painful process. Legal and regulatory pressures make it impossible to collaborate with external vendors efficiently. What’s more, health insurance providers want to be good shepherds of the sensitive data their patients trust them with. The Humana synthetic data exchange allows product developers to run faster tests and learns and to deliver better value to its members. All of this, while keeping their personal healthcare information perfectly safe.

Healthcare data platforms are not only benefiting health insurance companies, but are used to accelerate research and policy making across the world.

Our favorite synthetic data innovation hub

To overcome challenges, Humana set up a synthetic data sandbox. Using these granular, high-quality synthetic datasets, the relationships between different variables of interest as well as the important context in which patient care takes place are preserved. Developers and data scientists can identify where the care journey has taken a synthetic individual, how they interacted with different sites of care and maintain the specificity of the data without being identifiable.

A synthetic data dashboard gives instant access to the data. Sample datasets can be downloaded for schema and data quality exploration. Plus a comprehensive data dictionary serves as documentation. The synthetic datasets provide data on demographics and coverage details, medical and pharmacy claims, dates, diagnosis, sites of care with maintained correlations and relationships throughout. This is exactly how a synthetic data sandbox should be.

The advantages of a synthetic data sandbox

By offering easy and safe access to high quality synthetic data, Humana gives developers the most important ingredient for successful product development. This granular, as-good-as-real source of knowledge is invaluable in identifying cohorts and improving customer experience. Proof of value is also much easier to come by. A solution’s benefits and accuracy, especially of machine learning applications, will only show themselves if the data is hyper-realistic.

New tools developed with realistic synthetic data assets allow the insurance provider to assess where and how to use them along the care journey. The Humana synthetic data sandbox allows product developers to work in a production-like data-environment without the security risks or lengthy access processes.

The Humana synthetic data sandbox provides realistic, granular data samples

Synthetic healthcare data is on the rise in Europe too

On the other side of the Atlantic, the European Union has published a research paper in which they generated synthetic versions of 2 million cancer patient's records. According to their assessment, the resulting synthetic data's accuracy (98.8%) makes it suitable for collaborative research projects across institutions and countries.

The synthetic patient data was also rebalanced during the synthesization process, making it represent minority groups better. This is crucial for training machine learning models, which might not be able to pick up on rare cancer types. The EU's Join Research Center expects that synthetic data will revolutionize medical AI by eliminating the data hurdle.

The Humana synthetic data blueprint for healthcare data management

AI-generated synthetic data is the tool enabling a wide variety of data-driven use cases in healthcare and health insurance - fighting cancer is only one of them. Humana’s Data Exchange offers real hope for the acceleration of health innovation. Without data, nothing is possible.

Humana is way ahead of others in their synthetic data journey and are already exploring ways to use synthetic data for ethical AI. Undoubtedly, this is one of the most exciting use cases of synthetic data, providing fairness, privacy and explainability to AI and machine learning models. Check out the Fair synthetic data and ethical AI in healthcare podcast episode, where Laura Mariano, Lead Ethical AI Data Scientist and Brent Sundheimer, Principal AI Architect at Humana explain how fair synthetic data helps them create fair predictions!

When insurers apply a deep understanding of consumers and policy holders to their operations, three opportunities to generate revenue emerge.

Most industries are competitive, but few are more so than insurance. With the bundling of products, education – informing consumers that premiums are merely one indicator of value – and catchy branding, today’s insurers have come a long way in creating brand awareness and loyalty.

Even so, the insurance market is marked by razor-thin margins, aggressive competitors and fickle consumers. Here are three ways insurers can increase their profits in such a challenging environment.

The answers to all of these questions can be found in the data customers and prospective customers create in their interactions every day; however, this information is often out of reach, expensive to gather, or too difficult to work with because of privacy and legal concerns. Fortunately, with synthetic data, insurers can overcome all of these issues with ease.

Synthetic data: the key to profitability for insurance companies

AI-generated synthetic data is a must-have tool for insurance companies, ready to be leveraged across a wide variety of insurance-specific use cases. Find out how to improve improve pricing and risk predictions, create meaningful and safe test data and be fully compliant with the strictest data privacy laws, like GDPR.