Synthetic data engineering in insurance and banking with Jim Hu, MOSTLY AI's Sales Engineer

Jeffrey Dobin: Welcome to another episode of the Data Democratization podcast. I'm Jeffrey Dobin, privacy lawyer and tech expert over at Duality Technologies, and here with me is my co-host and partner in crime, Alexandra Ebert, the chief trust officer and data privacy advocate for MOSTLY AI, the category-leading synthetic data company. Hello, Alexandra.

Alexandra Ebert: Hi, Jeff. It's great to be back.

Jeffrey: Absolutely. We have another MOSTLY AI synthetic data specialist today, right?

Alexandra: You're right. I had the pleasure to talk to one of our colleagues and to do a deep dive into synthetic data use cases across the banking, and also the insurance industry. Super exciting stuff.

Jeffrey: Awesome. I'm curious to hear what Jim has been up to lately. I used to work closely with him. He is a core member of the MOSTLY AI team, a seasoned data engineer with tons of hands-on experience with synthetic data, and also he has a pretty strong background in the financial services industry, working with data along every step of the way for some big financial players. He knows exactly how frustrating it is to not be able to access data or to work with heavily mass data that has low data value or data utility. He is definitely an advocate of privacy-enhancing technologies and I'm excited to hear how this conversation went. What did you guys talk about?

Alexandra: Plenty of things. First, I found out how Jim's passion for mechanical engineering morphed into the opportunity to engineer data. Then, of course, we also talked a lot about the financial services industry and the current pain points and business needs that we see there, and the trends and issues that he sees, especially at our customers, in banking and insurance. This episode will be particularly interesting for those who are looking to implement synthetic data.

Alexandra: Hi, Jim. It's really great to have another MOSTLY today on the Data Democratization podcast with me. Can you introduce yourself to our listeners and share a little bit about your background? We know each other quite a bit already. I also know that you as a kid wanted to build actually parts for rockets. How did you end up in the field of synthetic data?

Jim: Sure. For me, I'm from Singapore originally. My major was electrical engineering at the National University of Singapore.

There was a lot of studying around applying engineering and computer science topics in making tanks and fighter jets, which as a kid, I found extremely interesting and exciting, but then as I grew and I realized that there are many other bigger problems in the world outside of military and defense. Especially in the financial service sector, where everything is moving around me, that I can actually attach to a story or two, such as getting a salary or putting my savings into investments or paying tax and getting out to spend money on food and drinks without thinking about inflation. Then all of these everyday topics, they just got me really excited about working in the financial service industry.

I started out my career at Credit Suisse as an IT analyst, where I worked as a support engineer for a structured product trading desk, where we help clients structure and execute complex structured products around the world. Then I moved on to a project management team at Singapore sovereign wealth fund called GIC, where I worked together with risk managers, where I started to become a champion of technology in the financial service industry and I started to see how much business value new technology can bring to replace manual processes and also bring data-driven solutions to the everyday investment decisions of an investment fund.

Then I went on to do a master's degree in finance, to connect the dots to really connect the technology applications to the core business and the business models of the banks and also asset management firms, where I had the opportunity to work in a private equity fund in Australia, looking at mid-sized real estate, and then FinTech technology investments in Australia, New Zealand. Then I also had the opportunity to work on project finance to finance renewable energy projects in Vietnam, as well as infrastructure projects in North Africa and in Central Asia under China's Belt and Road Initiative at a merchant bank in Singapore.

Then I started to see the biggest trends in financial services, that is around automation of manual processes, and also a new generation of technology that is focused on automated decision making backed by data. That led me to explore my interest in financial big data and machine learning, and then artificial intelligence algorithms and applications in helping banks and asset management rethink about how they make their revenue and then do their business.

Synthetic data is one of those solutions that provide a large amount of value to the AI and ML algorithms without violating or revealing the privacy of the underlying customers or financial transactions, which is often the biggest challenge I have faced when I was working on projects in banks and asset management firms. That is where I thought of synthetic data as your pioneer topic, and I'd like to become a champion to spread the word and to help banks and asset management to fully realize the business value of what synthetic data can bring about. That's where I am at MOSTLY AI.

Alexandra: We are very happy that you joined us with your vast background in financial innovation and technology, and that you're now on the synthetic data side since you encountered these challenges of privacy and innovation in your previous job roles. In your role here within MOSTLY AI, you talk to many of our customers in finance and hear about their needs and also their aspirations. What would you say are the priorities today in data departments of banks and also insurance companies?

Jim: Let me start with what I observed in the insurance sector. We come across a lot of conversations with data scientists or principal data scientists who are leaders in the data management department of insurance companies. Often their biggest challenge is about the profit margin of the policy that they write for customers or for B2B reinsurance policies. For them, the cost of claims is usually around more than 50% of the cash outflow for each policyholder they may claim during the course of the insured period.

Their biggest challenge is to find out how to use the data they have about their customers to have the most accurate estimation of the claim projection during the course of the insured period. Any volatilities around that claim projections will eventually interfere with the accurate gross premium pricing they quote on the new customers. With that in mind, they are early adopters of actuarial science and machine learning algorithms to make data-driven decisions on policy applications.

Then the challenge faced by leaderships in the data management, or the data departments will be, where is my data infrastructure to support that business model and then to help get the most accurate claim projections so that the business unit will have the highest profit margin without sacrificing competitiveness by overcharging a high-quality applicant?

Alexandra: Yes, I think that's actually one of the key priorities in the insurance sector nowadays because it's so competitive, that they have the most attractive premium prices, but still, of course, the best profit margin. This is what you explained if understood correctly, is one of the key priorities.

Jim: Yes. They also discovered that there are many public data, that's what they call features, that can be linked to the specifics of their customers that may derive and that may actually evolve into a new priced model that becomes ever more accurate in claims projections, such as going into the public database where the weather patterns are stored, will help a home insurance company to map out the height of a house and then the soil type and then how likely it is to fall into the hurricane pattern or the storm path.

Then the home insurance company can give a very accurate quote for the applicants without ending up paying more claims than they charge the customers or overcharging a customer that has no possibility of being in a natural disaster. Then these climate-related claims somehow make up about 40% of the total claims of home insurance. How to get access to the public data without revealing the privacy of your customers, such as addresses or personal information, is also one of the biggest challenges that these thought leaders are trying to tackle in the insurance space.

Alexandra: I can imagine, but if you really manage to only increase a little bit in this regard and better understand how or connect the dots, then I think this can really translate to tremendous profit gains.

Jim: Yes. This is what we observed in the insurance sector. Then when it comes to banks, usually the biggest challenge around banks these days is around retail banks especially, is that the interest rate in Europe or America, in developed market is very, very low, if it's not negative at all. Then this really hurts the interest margin of the lending business, which makes up a significant part of the revenue.

The challenge is how do we increase the return on risk-weighted assets to get the most lifetime revenue of a loan while putting aside or locking up the risk-weighted assets as required by the base for regulation. Another part of it is that with large payment transactions that's going on among the retail customers, as well as the small merchants, that's a business customer of the banks. Then how does the bank detect suspicious transactions? Then how do they prevent suspicious transactions will become another big topic for the banking executives.

Alexandra: Okay. Many of things going on, optimizing prices, optimizing profits, also fraud detection, how can synthetic data help?

Jim: What we observe is that the lending business usually involves risk provisions and then the risk provisions have a variety of contributing factors, such as funding rates from the treasury department or credit risk charge that usually come from a credit risk modeling. Then what we observe is that with mortgage lending, auto lending that has a long duration, usually between 10 years to 30 years for housing loan, a bank usually has a lot of loan applications, so they have a large amount of customer data.

As of today, a lot of these credit risk provisions are calculated based on rules of the financial theories. However, with a large amount of financial big data, there is also a growing trend to derive credit risk or credit scoring based on PIIs and the personal information to use data-driven machine learning algorithm to provide a second thought opinion so that they can complement the rule-based credit risk calculation.

Then if a customer has very high earning power and has a lot of assets in their investment portfolio, and then the data-driven solution based on machine learning, using synthetic data, gives the accurate estimation of the credit risk would mean that the price for this customer would be lower. Then compared to a bank that doesn't use data-driven solution, or if their machine learning was trained on a less accurate data set, then this bank will be able to win this customer with the most accurate pricing compared to other banks that use less accurate pricing models.

Alexandra: Basically it's one of the key priorities for them to unlock their production data in a synthetic way so that it's privacy-compliant to use for the machine learning and advance analytics being done on this data. Correct?

Jim: That's right. Synthetic data itself produced based on the production data give very rich value to train an accurate machine learning model. What we observed is that only about 40% of their customers give the consent for them to use the production data to conduct machine learning or AI model building. However, with owning a fraction of the production data, this introduces a lot of bias. For example, people who tend to give consent tend to be teenager customers, or customers in the mid-20s or late 20s, while older populations tend to refuse the consent.

With that in mind that the machine learning model training is only accurate when it's making decisions on personas or customers of younger age in this example where when it comes to older population customers, the business is not able to make any business useful decisions with its machine learning models.

Alexandra: To jump in, basically they're losing out the potential on those customer profiles that are much more profitable and much more valuable to better understand them and to better serve their needs. Is that correct?

Jim: That's exactly true. We all know that the majority of the wealth in the society for banking customers or insurance customers is in the older population, with more savings and the more pension. In order to capture their personalized preferences and then to tailor a very highly personalized experience for them, it is very important to use synthetic data, where our banking customers can use 100% of the production data synthesized for downstream AI and machine learning model training, where it will increase the accuracy for the entire customer base.

Alexandra: That's definitely an exciting use case. What would you say so far was the biggest synthetic data success story that you were part of?

Jim: I think the biggest success in synthetic data story was our work with a home insurance company in the United States. It was a very large Fortune 100 company, where it came to us with a use case. They were not allowed to do model training on real customer information. They have built a very successful model that is ready to be deployed, but then due to the regulation in the United States, the model they developed is only allowed in the research initiative or collaboration with academia, but then to fully realize commercial value of the model, they were hurdled and then challenged because they couldn't use production data of the customers.

What they really want to achieve is to use this newly designed and develop model to reprice home insurance of their customers by also using public database where that would require broadcasting home addresses of customers in the public domain, which poses large privacy risk, and also the US didn't allow them to go down to street level information of the homes in order to derive more accurate pricing details.

However, with synthetic data based on the risk assessment and then legal opinion around regulation and what the business impact might be, they realized that if they synthesized the customer data, then the value of the model pricing for the data can be unlocked to a very granular level that allows them to use features from the public database and also at the level of a street-level that gives them very, very accurate of claim projections by mapping to storm frequencies and hurricane path, and also extreme cold weather or extremely hot weather.

All of this information would be unlocked to help them price a new home insurance application much more accurately than they would have done at the city level in the past.

Alexandra: That's definitely a super exciting use case. We also hear this from other prospects that they're really interested in joining formally PII data, now synthetic data, with other available data sources being this publicly available data sources or other data sources within the organization, just to get a more complete picture and being able to better personalize and deliver better services to the customers. Great to see one of these practical examples that you just shared.

With all those organizations that are now looking into synthetic data, seeing all these success stories around, and being interested in using it themselves, what would you say what do they have to have in place to be able to embrace synthetic data? Do you also see any challenges to adoption of synthetic data? If yes, how can organizations overcome those?

Jim: For me, I think the challenges around adopting synthetic data is still the unfamiliarity. With synthetic data, is a concept that most of us already struggle to understand numbers in what is the average value of a population or of a data point, but then the introduction of synthetic data requires a little bit of understanding of the standard deviations beyond the average data point.

To feel comfortable with the privacy protection, and then to understand that synthetic data is not traditional. It's very different from traditional anonymization by just masking columns or throwing away columns, is the first step. However, for organization to embrace synthetic data, in my view, there are few steps that can be prepared. I think the first point is to understand and recognize that there's a huge amount of data that sits within the organization that has a lot of predictive value or statistical value for the organization to make data analytics decisions or to push for AI and machine learning initiatives to transform the business units from rule-based decision making to data-driven decision making.

At the same time, the challenge around the organization is around speed to data and also time to market of a new business initiative or technology initiative. The third one that I see is that most of the organizations have somewhat hurdles to access this very valuable big data. It just could be GDPR in Europe, and it could be CCPA in the US, or HIPAA in the insurance sector, around health care datasets, or for large global organizations, what we see is that they have data collected in Japan, in Singapore, in Europe, and also in America. Then to solve a problem that happens in Japan, the data scientist based in America is not even allowed to use sharing screen feature to look at the data.

So they have to fly to Japan to really look at the data to solve a local problem. Then all of these challenges around accessing data can be resolved with synthetic data in the case of the previous example. The Japanese subsidiary can send synthesized data that describes the data science problems back to the HQ in America, where data scientists based in the US can solve the problem and then recommended solution without flying to Japan, especially during the pandemic.

Alexandra: Sounds much more efficient and also safe during pandemic times. To sum up, you say, on the one hand, it's really this awareness level of what synthetic data is, and that it's so fundamentally different to legacy anonymization techniques because it doesn't stick to the original data set and just tries to mask and obfuscate some parts of this original data, which can be re-identified according to research, but then synthetic data really just learns the statistical distributions within the data and creates something fundamentally new and impossible to re-identify from scratch.

Then if I understood correctly, you mentioned that synthetic data is also right for those organizations who are digitally already a little bit more mature. Namely, they understood the potential of the data; they are in the process of implementing AI and advanced analytic practices in departments and now they're running into these hurdles of not being able to access data or just having to go through these lengthy processes.

Jim: That's correct. The misconception is that there is no one-to-one relationship between synthetic data and the original data, unlike the traditional anonymization tools.

Alexandra: Yes, absolutely. I think that's an important aspect, especially when it comes to fulfilling GDPR requirements on anonymization and truly having data assets that are out of scope of GDPR. What would you say what is the business value of using synthetic data in banking? Are there any measurable gains that synthetic data enables?

Jim: If I look at specifics in banks, or what I've seen in the front line, speaking to prospects, there are about three types. Business value can be defined by revenue in banking. Like I shared, in commercial lending, if your machine learning algorithm trained on synthetic data improved the accuracy in credit risk provisioning by even one bit, what would they say? Every bit counts. That would have to mean for a 30-year-loan at $1 million, you could have it translate into millions of dollars when you underwrite hundreds of these loans a year.

In the payment sector, if you can detect suspicious transactions, usually what we see could be 1% or slightly below 1% of your total transactions. Then if you are building out a machine-learning algorithm to detect suspicious transactions, then you have the challenge to use data to train your algorithm. That is where synthetic data can learn from your production transactions, and then also you can balance a little bit between the highly unbalanced suspicious transactions with the regular transactions, so your model sees more enriched data points that is considered suspicious.

This is the place where you can prevent loss due to fraudulent transactions using synthetic data. When it comes to privacy regulation of using production data, what we observe is that to be able to use a production transaction data point, you need a consent from both the sender of the transaction and the receiver of the transaction, which is extremely challenging when you already struggle with getting consent from one single customer. These are some of the business values for synthetic data in banking when it comes around profit and the potential loss to fraud. Also, what we see is also about process optimization when it comes to business application development.

For example, if you imagine that if it's the capital markets application on the trading desk where you need to make trading of listed derivatives at a very fast speed, and then a regulation or a change of trading algorithms is going to be built by developers and tested and rolled out into production for trading in the next week, what we see is that to improve the testing quality for a business application to see all kinds of edge cases and the realistic business scenarios, synthetic data tend to outperform the traditional test data where people just create some dummy data in the test environment and then overlook a lot of the highly realistic edge cases.

With synthetic data, the likelihood of you introducing a bug into mission-critical business application, such as a trading application or platform, is much, much lower. That shortens your testing cycle, makes your trading application faster to the market so you can capture the new potential revenue of a new tradable product without interruption to your trading activities because you can see that every day of downtime of the application on the trading floor, it could be millions of lost revenues on commission for all types of products.

This is where we see that improved testing process in banking sector can also benefit tremendously by using synthetic data, especially if the application is business-driven and mission-critical.

Alexandra: Absolutely. I think synthetic data for testing is one of the hottest topics in the next few years. I also once at a conference, met a C-level executive from one of the biggest banks in Europe and he told me simply by solving these testing headaches he's convinced that synthetic data will be one of the most important technologies from now until in 10 years, so definitely a big pain point in the industry.

One other aspect I would love to touch is actually personalization, which also receives quite a lot of attention in the financial services industry, especially due to its potential to help the bank to stay competitive, fight off all these new banks, and increase profit. How can synthetic data support personalization and what would you say is the recipe for success here?

Jim: I think the banks these days are facing quite a lot of challenges from the disruptors such as the neobanks that can send you the ATM card in five days wherever you live in this world and then charges you almost zero FX exchange rates when you shop abroad. This really puts the traditional banks on the hot seat, that they want to stand out and then retain the customers and then that's where they realize the personalized banking experience is the first improvement to retain their existing customers and also to attract new customers from those who may go for a neobank.

Customer personalized banking experience, in my view, depends on a lot of behavioral patterns because at the retail level people have different needs when they're at a different stage of the experience. For example, a student who has a bank account would care more about where is-- They're price-sensitive, so they care about what is the most delicious restaurant at the lowest cost, or what is the best bookstore to buy secondhand books. While a more senior customer who has worked in the professional industry may care more about where can I put my savings in the investment product so I can comfortably retire.

To be able to understand the complex behavior in large patterns of their customers, that is where the rule-based approach hits its limit and then data becomes extremely valuable to help the bank to personalize the experience they offer to the customers based on what the bank can learn from the data of customer of different profiles. This touches a very sensitive point. As a person, I don't want anybody to know about how much savings I have. I don't want anybody to find out about what I spend my money on or how much money or wealth I make a year.

Alexandra: I think there's actually also this statistic or this survey that was done on a large scale that financial services data is the most sensitive data for individuals. They don't care as much about their health care information or the telecommunications information being leaked as they do about the financial services data, so this is really something that's super important to people that it stays protected.

Jim: Exactly. You're right. Then how can a bank personalize the experience of their customers without looking at their data? That's where synthetic data becomes extremely vital for a bank to roll out any personalization of initiatives to really understand the behavior of their customers without using the data of their customers. I think the recipe for success here will have 100% of coverage of the behavioral patterns, so you address bias, and then also develop an AI and ML model that is highly accurate for the type of customers that you're serving. With that, then you really impress your customers with that personalized experience.

Alexandra: That's true. Actually, one of my all-time favorite examples is the George success story from Erste Group, where they really managed to build one of the most customer-centric apps that does not only cater to the needs of the average customer but covers everything from college students to top earners to retired persons due to utilizing synthetic data in the development process.

And I think, since they are the not only most downloaded app but also highest rated digital banking app in their target markets, this really proves the impact synthetic data can have on these aspects. Maybe before we come to an end, what would you say are your dreams for the future of banking, for your future, for the future of synthetic data? Also, do you think will synthetic data one day power the rocket that you wanted to build as a child?

Jim: For me, I guess building a fighter jet or tank is probably not what synthetic data is directly relevant to, because you cannot touch it, you cannot feel it, and you cannot see it, but then, with my passion in financial service industry, I do see a bright future that is backed and supported by synthetic data. Having worked in financial service industry, for me, an ideal customer experience for myself or my dream for the future of my banking experience will be that I'm fully empowered as a customer with very accurate and the relevant information to make smart and fair decisions by myself.

Maybe in my parent's generation, they'd actually go to a bank and ask for advice and ask for recommendations of a savings product or financial product for retirement, but for me, I want to be able to just be provided with the fact that's relevant to my situation, then I can make a decision in terms of what solutions I want and what kind of product I need. Also, I would like to see a personalized automated system that can fulfill my daily banking needs, so I don't always have to click zero in the dial pad. Sorry, your IT system on the phone is not personalized enough for what I need. I have to speak to your customer service agent.

Hopefully, in the future, the machine learning and AI models are trained so sophisticatedly with very fully covered synthetic data that the capabilities that have been built in there for customers are fully automated that nobody ever needs to call the customer service agent to do anything that AI can now do.

Alexandra: I'm with you on that. I also don't want to and don't have the time to go to the bank all the time. I prefer if I'm able to do it with my banking app. Then, of course, if I have questions, I think we also talked about robo-advisors in the past. This is, of course, I think also a quite promising topic. Can you explain quickly to our listeners why robo-advisors are so promising and so relevant nowadays?

Jim: Yes, sure. Robo-advisors essentially, it's like your financial planner who introduces a range of investment products. Then they think that it's suitable for you for your need and then you decided to purchase from the financial planner. Instead of having a lengthy conversation for the financial planner to decide what is suitable for you, a robo-advisor learns about your income level, your family situation, your travel pattern, your personal habits. Then this robo-advisor has learned what is the best investment product for you based on current market condition and macro factors and also your aspirations.

It's really powerful in a way that the robo-advisor is able to learn from a much bigger data point and then draw much more successful conclusions based on historical success of a very large database than your average personal financial planners who may be biased, or who may be limited to certain opinions. Also, we know that on average an active fund manager has the active return of zero. Therefore, any fund manager that has positive active returns, 50% of the chance, that is luck, and 50% of the chance is actual skill.

For average person or retail consumer that doesn't have access to top fund managers from the most successful hedge fund or pension funds or mutual fund, your robo-advisor may be just as good as your average mutual fund manager that you can access too. That's why I believe that the robo-advisor if trained on the accurate machine learning model backed by the correct data, then it will be just as good as if you speak to your financial planner around your neighborhood.

Alexandra: Or even better in the future. I think it's a super promising value proposition, that you get it with more convenience, faster, with better return on investment, both for you as well as for the bank. Definitely looking forward to my own robo-advisor. Jim, as we're coming to an end, what would be your final message that you wanted to leave our listeners with, especially those data scientists and engineers working at banks and insurance companies who might are currently looking into synthetic data?

Jim: I'd like them to remember that synthetic data is not a research topic. Our solution is running the production of our customers in banks and insurance sector, where it's being applied to solve business problems around insurance premium pricing as well as designing a personalized banking experience in the mobile app. I'd like to encourage them to identify some business problems associated with using their financial big data or data that is blocked or hurdled by privacy regulations. Then I'd also like to encourage them to try it out and then better understand synthetic data and what it offers. For any additional details or questions, I'm more than happy to talk to them.

Alexandra: Perfect. I'm sure you will. Everybody who wants to reach out, just go to mostly.ai/contact, then you can approach Jim to learn more about synthetic data and how it can help you to achieve your business objectives. Jim, thanks a lot for taking the time. I think this will definitely be one of the episodes where we will point over prospects towards to if they want to learn more about how synthetic data can actually help to reach business outcomes in the insurance and banking industry. Super insightful.

Jim: My pleasure. Thanks for having me.

Jeffrey: That was a pretty cool behind-the-scenes glimpse into synthetic data engineering. I'm always amazed at what kind of everyday uses data scientists and analysts face in their day-to-day experiences with their work. Flying across the world to access data is one of my favorite examples of how difficult things can get with all of today's compliance regulations tied up around data sharing.

Alexandra: Yes, having to fly to another part of the world definitely not sounds like a super-efficient approach. I really like today's episode. Jim gave us many great examples for why synthetic data is better than production data. He also shared a lot of industry-specific insights into using synthetic data in insurance and banking. I think we have really valuable takeaways here. Do we want to sum them up?

Jeffrey: Yes, absolutely. Let's start with some examples from the insurance sector. Data scientists working at insurance companies are trying to increase profit margins, improve policies, and accurately project claims. Linking public data can improve the accuracy of their pricing models. For example, weather patterns can help analysts better tune pricing models for homeowners' insurance.

Alexandra: That's right, but getting access to public data without revealing sensitive information about customers, such as home addresses, is very challenging for data scientists working on insurance projects. Synthetic data, as Jim explained, can help data scientists to link customer data to public datasets in a privacy-compliant manner.

Jeffrey: Yes. Then in the retail banking space, some challenges include low-interest rates that hurt the interest margin of the lending business. Some important questions in the banking industry revolve around, how do we increase the return on risk-weighted assets, how to maximize the lifetime revenue from a loan, and how the bank detects suspicious transactions and also prevents fraud. The answers to these questions and ability for banks to succeed is very much dependent on the quality of training data for their machine learning models. This is where Jim shared that synthetic data and other privacy-enhancing methods can come into play.

Alexandra: Correct. For example, mortgage lending can benefit from credit risk calculations based on personal data and machine learning to complement the traditional rule-based credit risk assessments. Using synthetic data gives an accurate estimation of the credit risk. This gives the bank the opportunity to price mortgages lower than those banks not using machine learning to assess credit risk, or banks that use machine learning trained on less accurate data. The conclusion from Jim was that accurate pricing of retail banking products like mortgages gives a competitive edge to banks.

Jeffrey: Customer consent is the next big topic where synthetic data comes in handy. Companies are telling us that only around 30% of their customers give consent to use their data, which introduces a potential bias, and lack of data hinders machine learning training. Jim was sharing that for transaction data in banking, data access is especially challenging since both the sender and the receiver should provide consent, which if you think about it is nearly impossible to obtain.

Jim shared that synthetic data allows banks to use their customer data for training these machine learning algorithms while incorporating all types of customer personas in their training data, which increases the accuracy of these machine learning models for really the entire customer base.

Alexandra: Yes, that's definitely impressive, what you can do with that. Then we also had a very exciting synthetic data case study from the insurance sector. A Fortune 100 insurance company built a great machine learning model that was ready to be deployed, but due to regulations in the United States, they could not realize the commercial value of this model. The goal was to reprice to customers' home insurances by using a public available database. However, privacy issues prevented them from using home addresses in this public database. Going down to street level data also posed a hurdle, and it was forbidden by regulations.

Synthetic data allowed them to finally unlock granular details and unlock the data for the machine learning training. The result, extremely accurate claim projections with street-level differences instead of city-level pricing the company had before. Joining PII data with publicly available datasets to get a more comprehensive picture of the market is only possible if you could generate synthetic data from a sensitive original data set.

Jeffrey: Jim also talked about some of the challenges customers face when approaching synthetic data, and understanding the concept is not easy and often requires a bit of learning because synthetic data is really very different from legacy data anonymization techniques like masking, obfuscation, or pseudonymization. To embrace synthetic data, organizations should understand that there is a huge amount of data sitting in databases with statistical value for data analytics, AI, and machine learning, and to transform business units from rule-based decision-making to data-driven decision-making.

Alexandra: That's right. Then Jim also highlighted that the business value of synthetic data is huge over the long run. Even a small improvement in model performance, for example, in fraud cases, leads to significant savings over time. Fraud detection algorithms where you only improve per 1% have significant business impact and accessing data that would not be available without a synthetic data alternative, like in the case of transaction data in banking, creates additional value for organizations.

Jeffrey: Yes. Process optimization is another important topic where synthetic data has a huge business impact, so to improve the quality of business application testing in a realistic environment with synthetic edge cases, organizations should use synthetic copies of their relevant production data, and then shortening testing cycles is also an important goal that synthetic data can serve.

Alexandra: You are right. Then we all know how important personalization is nowadays. Providing personalized banking experiences is therefore high up on the agenda for retail banks. Rule-based approaches, however, fail to understand the complex behavioral patterns of your customers. The recipe for success with personalization is to have 100% coverage of behavioral patterns and have those available for your AI and machine learning initiatives so that you can really serve the types of customers that you have and provide personalized offerings for them.

Jeffrey: Totally. These are definitely exciting times for those working with data. I think that privacy-enhancing technologies like homomorphic encryption, federated learning, and as Jim pointed out, synthetic data will revolutionize the way we access insights and digitally transform the way our organizations run.

Alexandra: Yes, absolutely. I think so too. I'm really looking forward to hearing more of these synthetic data success stories from Jim, and also the other members of our team.

Jeffrey: Stay safe, and we'll see you next time.

Alexandra: See you.

Synthetic data engineering in insurance and banking with Jim Hu, MOSTLY AI's Sales Engineer

Transcript

Ready to start?