Synthetic data for strategy, innovation, and governance — the 5 most common business problems we can solve

Alexandar Ivanovski
Alexandar Ivanovski

Working within the sales team allows me to speak with a wide range of companies across a broad spectrum of industries. Data protection and data innovation seem to be the main concerns that I encounter. In the following blog post, I’ll give an overview of the most common data and privacy business problems I hear about from our clients. The general trend is that companies want to access the vast amount of data that can help with strategic decisions and improve their services and products. There seem to be some underlying themes when it comes to having a strong data protection framework and being more data-driven and innovative. The two things appear antithetical to each other on the surface.

The data privacy vs. utility trade-off

We refer to this as the privacy vs. utility trade-off. Until a few years ago, the methods of protecting customers’ data have either been pseudonymization or some other form of data masking. This not only destroys the utility of the data set, but classic anonymization endangers privacy in more ways than one. MOSTLY AI is challenging this dangerous status quo by providing synthetic data alternatives to help companies become more innovative and data-driven. We are taking on the privacy-utility trade-off.

The most common business problems synthetic data can solve

#1 “Privacy gaps and the risk of re-identification are real threats”

With stringent regulations such as GDPR, CCPA, and so forth, the need for organizations to have a strong data protection strategy in place is of the utmost importance. So is the watertight protection of sensitive personal information. Within the banking, insurance, healthcare, and pharma sectors, this is an important function because they know how randomly data breaches can strike and the consequences they can have. At the same time, there is a need to use this data to keep the company’s operations going and improve its service. Companies try to reconcile these, and what they often end up with is an illusion of privacy provided by less-than-secure, outdated anonymization techniques. Although privacy budgets doubled in 2020 to an average of $2.4 million, privacy departments often still lack the technological competencies necessary to assess data privacy risks and to use privacy-tech solutions to offer meaningful access to data.

Synthetic data really changes the way things are done, as it allows companies to share synthetic data sets, which still provide the full picture but without the sensitive information. This helps companies continue with their objectives while mitigating the risk of sensitive data being leaked and re-identified.

We often ask clients to try to think of data as a product they should build and sell across their organizations. As with all products, the method of manufacturing is what makes that product safe to use. According to Gartner, 59% of privacy incidents originate with an organization’s own employees, and although data literacy certainly helps, the goal of every organization should be to provide safe-to-use data products in the first place. Synthetic data products are safe to use in all downstream tasks, from analytics to testing and AI training.

#2 ‘Getting access to data takes time’

We all know that the sensitive information that companies have needs to be protected, which is done by implementing a strong, strict data governance policy with checks and balances in place. It’s an important function, but it also means that the process to get access to data internally can take a while. This means that projects can take longer or even be killed, and this can cause frustration.

Customer data needs to be used across many departments.  Some examples include Product’s desire to analyze customer data so that more customer-centric products can be made. QA needs data that mimics customer data realistically to help test applications and ensure that all edge cases are covered. Data, BI, and Analytics need to analyze the data to make findings that assist management in making strategic decisions. You know what I mean, though; the demand for data internally is significant.

This is where synthetic data has really helped our clients. They were able to decrease their time to data dramatically through our MOSTLY GENERATE platform. Synthetic data sandboxes can even speed up traditionally cumbersome processes, such as POC evaluations in which potentially sensitive data needs to be shared with third parties. Once synthesized, Data Governance was satisfied that the data adhered to data privacy legislation and cleared the data for use. This meant that projects weren’t stalled or losing momentum.

#3 “I’m trying to scale AI, but don’t have the right data”

Most companies that we deal with sit upon a huge amount of sensitive data. We can see that all companies know the importance of and want to share this data internally to improve access to the information within the organization. AI adoption is especially fraught with data access issues.

The problem is that most data is stored away in siloed warehouses that require a lengthy internal process to access. Data issues are the main reasons why companies fail to implement AI successfully. Also, the data provisioning overhead is staggering; data scientists spend most of their time cleaning and organizing data instead of using it. 

Synthetic data is more than just a privacy-safe data alternative for AI training. We’ve helped customers augment their data for AI training by synthesizing better-than-real datasets. The result is privacy-compliant AI, which performs better than models trained on production data. Using synthetic data for fraud detection is typically one of those use cases where even a few percentage points of performance improvement can result in huge savings.

Biased data gets a lot of companies into trouble when AI starts learning discriminatory patterns from imbalanced datasets. Synthetic data provides a fair solution and allows models to learn about a doctored, bias-free reality. What’s more, synthetic data can serve as a window into the souls of AI algorithms and is expected to play an important role in Explainable AI. With the recent AI regulation proposal from the EU, high-risk AI systems, such as HR software, will be subject to strict regulations demanding high quality of the datasets and regulatory oversight. Synthetic training data will be a crucial ingredient to compliance.

#4 “We are striving for a more data-driven culture”

This is one of the most common statements I come across. What we see is that all companies have their own techniques and strategies that are meant to force this cultural change. Though when it comes to implementing, both on a macro and micro level, it’s starting to get difficult as the first step towards data literacy, sharing sensitive data is difficult in itself, and the time taken to get approval means that these projects get lagged.

The difficulty we see with companies is that they’re restricted internally with this data and need to apply some data masking to actually use it. This destroys its utility, and it still can be re-identified. We think that data literacy needs a revamp. Organizing datathons are a great way to put people in touch with what’s being measured, driving innovations and increasing data literacy. Using synthetic data sandboxes, you can maximize impact and even open up these events to external talent and academia.

#5 “We want to use real data in testing environments but have trouble gain accessing of this data”

We see many companies with the challenge of wanting to use real production data in non-productive environments, such as QA and testing. We can’t blame them either, as the challenge they have is wanting accurate data to test their applications, and we all appreciate how difficult this process can be. Some try to create their own solution or MVP, but that doesn’t yield the results they want. The data needs to be realistic to be properly used for testing environments. 

What’s more, most companies use partially masked data for testing, exposing production data and their customers’ privacy in the process. Production data has no place in testing, no matter how scrambled or pseudonymized it is. The only safe and GDPR-compliant way forward is to go synthetic, and those who act against the inertia of embedded bad practices will emerge as the winners, gaining a competitive edge through innovation.

The need for test data is one of the main reasons why clients come to us, wanting to solve this issue by using realistic and safe synthetic data through our MOSTLY GENERATE platform. They have seen improvements in the testing of their products and have reduced the time taken to manually recreate their data sets with dummy data internally. They are able to attain realistic synthetic data within a short period of time once the original dataset has been placed. The resulting highly realistic synthetic version allows companies to develop data-driven digital products, ready to serve real customer needs from day one.

Do you have a question? I’m happy to talk synthetic data with you. Please feel free to contact us!

Contact us to learn more. We are happy to get in touch! hello@mostly.ai