💡 Download the complete guide to AI-generated synthetic data!
Go to the ebook

The European Union’s Artificial Intelligence Act (“AI Act”) is likely to have a profound impact on the development and utilization of artificial intelligence systems. Anonymization, particularly in the form of synthetic data, will play a pivotal role in establishing AI compliance and addressing the myriad challenges posed by the widespread use of AI systems.

This blog post offers an introductory overview of the primary features and objectives of the draft EU AI Act. It also elaborates on the indispensable role of synthetic data in ensuring AI compliance with data management obligations, especially for high-risk AI systems.

Notably, we focus on the European Parliament’s report on the proposal for a regulation of the European Parliament and the Council, which lays down harmonized rules on Artificial Intelligence (the Artificial Intelligence Act) and amends certain Union Legislative Acts (COM(2021)0206 – C9-0146/2021 – 2021/0106(COD)). It's important to note that this isn't the final text of the AI Act.

The first comprehensive regulation for AI compliance

The draft AI Act is a hotly debated legal initiative that will apply to providers and deployers of AI systems, among others. Remarkably, it is set to become the world's first comprehensive mandatory legal framework for AI. This will directly impact researchers, developers, businesses, and citizens involved in or affected by AI. AI compliance is a brand new domain that will transform the way companies manage their data.

The choice: AI compliance or costly consequences

Much like the GDPR, failure to comply with the AI Act's obligations can have substantial financial repercussions: maximum fines for non-compliance under the draft AI Act can be nearly double those under the GDPR. Depending on the severity and duration of the violation, penalties can range from warnings to fines of up to 7% of the offender's annual worldwide turnover.

Additionally, national authorities can order the withdrawal or recall of non-compliant AI systems from the market or impose temporary or permanent bans on their use. AI compliance is set to become a serious financial issue for companies doing business in the EU.

Risk-based classification

The draft EU AI Act operates on the principle that the level of regulation should align with the level of risk posed by an AI system. It categorizes AI systems into four risk categories:

AI compliance risk

Synthetic solutions for AI compliance

The draft AI Act focuses its regulatory efforts on high-risk AI systems, imposing numerous obligations on them. These obligations encompass ensuring the robustness, security, and accuracy of AI systems. It also mandates the ability to correct or deactivate the system in case of errors or risks, as well as implementing human oversight and intervention mechanisms to prevent or mitigate harm or adverse impacts, as well as a number of additional requirements.

Specifically, under the heading “Data and data governance”, Art. 10 sets out strict quality criteria for training, validation and testing data sets (“data sets”) used as a basis for the development of “[h]igh-risk AI systems which make use of techniques involving the training of models with data” (which likely encompasses most high-risk AI systems).

According to Art 10(2), the respective data sets shall be subject to appropriate data governance and management practices. This includes, among other things, an examination of possible biases that are likely to affect the health and safety of persons, negatively impact fundamental rights, or lead to discrimination (especially with regard to feedback loops), and requires the application of appropriate measures to detect, prevent, and mitigate possible biases. Not surprisingly, AI compliance will start with the underlying data.

Pursuant to Art 10 (3), data sets shall be “relevant, sufficiently representative, appropriately vetted for errors and as complete as possible in view of the intended purpose” and shall “have the appropriate statistical properties […]“.

Art 10(5) specifically stands out in the data governance context, as it contains a legal basis for the processing of sensitive data, as protected, among other provisions, by Art 9(1) GDPR: Art 10(5) entitles high-risk AI system providers, to the extent that is strictly necessary for the purposes of ensuring negative bias detection and correction, to exceptionally process sensitive personal data. However, such data processing must be subject to “appropriate safeguards for the fundamental rights and freedoms of natural persons, including technical limitations on the re-use and use of state-of-the-art security and privacy-preserving [measures]“.

Art 10(5)(a-g) sets out specific conditions which are prerequisites for the processing of sensitive data in this context. The very first condition, as stipulated in Art 10(5)(a) sets the scene: the data processing under Art 10 is only allowed if its goal, namely bias detection and correction “cannot be effectively fulfilled by processing synthetic or anonymised data”. Conversely, if an AI system provider is able detect and correct bias by using synthetic or anonymized data, it is required to do so and cannot rely on other “appropriate safeguards”.

The distinction between synthetic and anonymized data in the parliamentary draft of the AI Act is somewhat confusing, since considering the provision’s purpose, arguably only anonymized synthetic data qualifies as preferred method for tackling data bias. However, since anonymized synthetic data is a sub-category of anonymized data, the differentiation between those two terms is meaningless, unless the EU legislator attempts to highlight synthetic data as the preferred version of anonymized data (in which case the text of the provision should arguably read “synthetic or other forms of anonymized data”).

Irrespective of such details, it is clear that the EU legislator clearly requires the use of anonymized data for the processing of sensitive data as a primary bias detection and correction tool. It looks like AI compliance cannot be achieved without effective and AI-friendly data anonymization tools.

Recital 45(a) supports this (and extends the synthetic data use case to privacy protection and also addresses  AI-system users, instead of only AI system providers):

The right to privacy and to protection of personal data must be guaranteed throughout the entire lifecycle of the AI system. In this regard, the principles of data minimization and data protection by design and by default, as set out in Union data protection law, are essential when the processing of data involves significant risks to the fundamental rights of individuals. 

Providers and users of AI systems should implement state-of-the-art technical and organizational measures in order to protect those rights. Such measures should include not only anonymization and encryption, but also the use of increasingly available technology that permits algorithms to be brought to the data and allows valuable insights to be derived without the transmission between parties or unnecessary copying of the raw or structured data themselves.”

The inclusion of synthetic data in the draft AI Act is a continuation of the ever-growing political awareness of the technology’s potential. This is underlined by a recent statement made by the EU Commission’s Joint Research Committee: “[Synthetic data] not only can be shared freely, but also can help rebalance under-represented classes in research studies via oversampling, making it the perfect input into machine learning and AI models."

Synthetic data is set to become one of the cornerstones of AI compliance in the very near future.


TABLE OF CONTENT

Why should you care about data anonymization tools?

Data anonymization tools can be your best friends or your data quality’s worst enemies. Sometimes both. Anonymizing data is never easy, and it gets trickier when:

You try to do your best and use data anonymization tools on a daily basis. You have removed all sensitive information, masked the rest, and randomized for good measure. So, your data is safe now. Right? 

As the Austrians—Arnold Schwarzenegger included—say: Schmäh! Which roughly translates as bullshit. Why do so many data anonymization efforts end up being Schmäh?

Data anonymization tools: What are they anyway?

Data anonymization tools conveniently automate the process of data anonymization with the goal of making sure that no individual included in the data can be re-identified. The most ancient of data anonymization tools, namely aggregation and the now obsolete rounding, were born in the 1950s. The concept of adding noise to data as a way to protect anonymity entered the picture in the 1970s. We have come a long way since then. Privacy-enhancing technologies were born in the 90s and have been evolving since, offering better, safer, and more data-friendly data anonymization tools. 

Data anonymization tools must constantly evolve since attacks are also getting more and more sophisticated. Today, new types of privacy attacks using the power of AI, can reidentify individuals in datasets that are thought of as anonymous. Data privacy is a constantly shifting field with lots of moving targets and constant pressure to innovate.

Data anonymization tools: How do they work?

Although a myriad of data anonymization tools exist, we can differentiate between two groups of data anonymization tools based on how they approach privacy in principle. Legacy data anonymization tools work by removing or disguising personally identifiable information, or so-called PII. Traditionally, this means unique identifiers, such as social security numbers, credit card numbers, and other kinds of ID numbers.

The trouble with these types of data anonymization tools is that no matter how much of the data is removed or modified, a 1:1 relationship between the data subject and the data points remains. With the advances of AI-based reidentification attacks, it’s getting increasingly easier to find this 1:1 relationship, even in the absence of obvious PII pointers. Our behavior—essentially a series of events—is almost like a fingerprint. An attacker doesn’t need to know my name or social security number if there are other behavior-based identifiers that are unique to me, such as my purchase history or location history. As a result, state of the art data anonymization tools are needed to anonymize behavioral data.

Which data anonymization approaches can be considered legacy?

Legacy data anonymization tools are often associated with manual work, whereas modern data privacy solutions incorporate machine learning and AI to achieve more dynamic and effective results. But let's have a look at the most common forms of traditional anonymization first.

1. What is data masking?

Data masking is one of the most frequently used data anonymization approaches across industries. It works by replacing parts of the original data with asterisks or another placeholder. Data masking can reduce the value or utility of the data, especially if it's too aggressive. The data might not retain the same distribution or characteristics as the original, making it less useful for analysis.

The process of data masking can be complex, especially in environments with large and diverse datasets. The masking should be consistent across all records to ensure that the data remains meaningful. The masked data should adhere to the same validation rules, constraints, and formats as the original dataset. Over time, as systems evolve and new data is added or structures change, ensuring consistent and accurate data masking can become challenging.

The biggest challenge with data masking: to decide what to actually mask. Simply masking PII from data using Python, for example, still has its place, but the resulting data should not be considered anonymized by any stretch of the imagination. The problem are quasi identifiers (= the combination of attributes of data) that if left unprocessed still allow re-identification in a masked dataset quite easily.

2. What is pseudonymization?

Pseudonymization is strictly speaking not an anonymization approach as pseudomized data is not anonymous data. However, it's very common and so we will explain it here. Pseudonymization replaces private identifiers with fake identifiers, or pseudonyms or removes private identifiers alltogether. While the data can still be matched with its source when one has the right key, it can't be matched without it. The 1:1 relationship remains and can be recovered not only by accessing the key but also by linking different datasets. The risk of reversibility is always high, and as a result, pseudonymization should only be used when it’s absolutely necessary to reidentify data subjects at a certain point in time.

The pseudonyms typically need a key for the transformation process. Managing, storing, and protecting this key is critical. If it's compromised, the pseudonymization can be reversed. 

What’s more, under GDPR, pseudonymized data is still considered personal data, meaning that data protection obligations continue to apply.

Overall, while pseudonymization might be a common practice today, it should only be used as a stand-alone tool when absolutely necessary. Pseudonymization is not anonymization and pseudonymized data should never be considered anonymized.

3. What is generalization and aggregation?

This method reduces the granularity of the data. For instance, instead of displaying an exact age of 27, the data might be generalized to an age range, like 20-30. Generalization causes a significant loss of data utility by decreasing data granularity. Over-generalizing can render data almost useless, while under-generalizing might not provide sufficient privacy.

You also have to consider the risk of residual disclosure. Generalized data sets might contain enough information to infer about individuals, especially when combined with other data sources.

4. What is data swapping or perturbation?

Data swapping or perturbation describes the approach of replacing original data values with values from other records. The privacy-utility trade-off strikes again: perturbing data leads to a loss of information, which can affect the accuracy and reliability of analyses performed on the perturbed data. However at the same time the achieved privacy protection is not very high. Protecting against re-identification while maintaining data utility is challenging. Finding the appropriate perturbation methods that suit the specific data and use case is not always straightforward.

5. What is randomization?

Randomization is a legacy data anonymization approach that changes the data to make it less connected to a person. This is done through adding random noise to the data.

Some data types, such as geospatial or temporal data, can be challenging to randomize effectively while maintaining data utility. Preserving spatial or temporal relationships in the data can be complex.

Selecting the right approach (i.e. what variables to add noise to and how much) to do the job is also challenging since each data type and use case could call for a different approach. Choosing the wrong approach can have serious consequences downstream, resulting in inadequate privacy protection or excessive data distortion.

Data consumers could be unaware of the effect randomization had on the data and might end up with false conclusions. On the bright side, randomization techniques are relatively straightforward to implement, making them accessible to a wide range of organizations and data professionals.

6. What is data redaction?

Data redaction is similar to data masking, but in the case of this data anonymization approach, entire data values or sections are removed or obscured. Deleting PII is easy to do. However, it’s a sure-fire way to encounter a privacy disaster down the line. It’s also devastating for data utility since critical elements or crucial contextual information could be removed from the data. 

Redacted data may introduce inconsistencies or gaps in the dataset, potentially affecting data integrity. Redacting sensitive information can result in a smaller dataset. This could impact statistical analyses and models that rely on a certain volume of data for accuracy.

Next-generation data anonymization tools

The next-generation data anonymization tools, or so-called privacy-enhancing technologies take an entirely different, more use-case-centered approach to data anonymization and privacy protection.

1. Homomorphic encryption

The first group of modern data anonymization tools works by encrypting data in a way that allows for computational operations on encrypted data. The downside of this approach is that the data, well, stays encrypted which makes it very hard to work with such data if it was previously unknown the user. You can't perform e.g. exploratory analyses on encrypted data. In addition it is computationally very intensive and, as such, not widely available and cumbersome to use. As the price of computing power decreases and capacity increases, this technology is set to become more popular and easier to access. 

2. Federated learning

Federated learning is a fairly complicated approach, enabling machine learning models to be trained on distributed datasets. Federated learning is commonly used in applications that involve mobile devices, such as smartphones and IoT devices.

For example, predictive text suggestions on smartphones can be improved without sending individual typing data to a central server. In the energy sector, federated learning helps optimize energy consumption and distribution without revealing specific consumption patterns of individual users or entities. However, these federated systems require the participation of all players, which is near-impossible to achieve if the different parts of the system belong to different operators. Simply put, Google can pull it off, while your average corporation would find it difficult. 

3. Synthetic data generation

A more readily available approach is an AI-powered data anonymization tool: synthetic data generation. Synthetic data generation extracts the distributions, statistical properties, and correlations of datasets and generates entirely new, synthetic versions of said datasets, where all individual data points are synthetic. The synthetic data points look realistic and, on a group level, behave like the original. As a data anonymization tool, reliable synthetic data generators produce synthetic data that is representative, scalable, and suitable for advanced use cases, such as AI and machine learning development, analytics, and research collaborations. 

4. Secure multiparty computation (SMPC)

Secure Multiparty Computation (SMPC), in simple terms, is a cryptographic technique that allows multiple parties to jointly compute a function over their private inputs while keeping those inputs confidential. It enables these parties to collaborate and obtain results without revealing sensitive information to each other.

While it's a powerful tool for privacy-preserving computations, it comes with its set of implementation challenges, particularly in terms of complexity, efficiency, and security considerations. It requires expertise and careful planning to ensure that it is applied effectively and securely in practical applications.

Data anonymization approaches and their use cases

Data anonymization encompass a diverse set of approaches, each with its own strengths and limitations. In this comprehensive guide, we explore ten key data anonymization strategies, ranging from legacy methods like data masking and pseudonymization to cutting-edge approaches such as federated learning and synthetic data generation. Whether you're a data scientist or privacy officer, you will find this bullshit-free table listing their advantages, disadvantages, and common use cases very helpful.

#Data Anonymization ApproachDescriptionAdvantagesDisadvantagesUse Cases
1Data MaskingMasks or disguises sensitive data by replacing characters with symbols or placeholders.- Simplicity of implementation.
- Preservation of data structure.
- Limited protection against inference attacks.
- Potential negative impact on data analysis.
- Anonymizing email addresses in communication logs.
- Concealing rare names in datasets.
- Masking sensitive words in text documents.
2PseudonymizationReplaces sensitive data with pseudonyms or aliases or removes it alltogether.- Preservation of data structure.
- Data utility is generally preserved.
- Fine-grained control over pseudonymization rules.
- Pseudomized data is not anonymous data.
- Risk of re-identification is very high.
- Requires secure management of pseudonym mappings.
- Protecting patient identities in medical research.
- Securing employee IDs in HR records.
3Generalization/AggregationAggregates or generalizes data to reduce granularity.- Simple implementation.- Loss of fine-grained detail in the data.
- Risk of data distortion that affects analysis outcomes.
- Challenging to determine appropriate levels of generalization.
- Anonymizing age groups in demographic data.
- Concealing income brackets in economic research.
4Data Swapping/PerturbationSwaps or perturbs data values between records to break the link between individuals and their data.- Flexibility in choosing perturbation methods.
- Potential for fine-grained control.
- Privacy-utility trade-off is challenging to balance.
- Risk of introducing bias in analyses.
- Selection of appropriate perturbation methods is crucial.
- E-commerce.
- Online user behavior analysis.
5RandomizationIntroduces randomness (noise) into the data to protect data subjects.- Flexibility in applying to various data types.
- Reproducibility of results when using defined algorithms and seeds.
- Privacy-utility trade-off is challenging to balance.
- Risk of introducing bias in analyses.
- Selection of appropriate randomization methods is hard.
- Anonymizing survey responses in social science research.
- Online user behavior analysis.
6Data RedactionRemoves or obscures specific parts of the dataset containing sensitive information.- Simplicity of implementation.- Loss of data utility, potentially significant.
- Risk of removing contextual information.
- Data integrity challenges.
- Concealing personal information in legal documents.
- Removing private data in text documents.
7Homomorphic EncryptionEncrypts data in such a way that computations can be performed on the encrypted data without decrypting it, preserving privacy.- Strong privacy protection for computations on encrypted data.
- Supports secure data processing in untrusted environments.
- Cryptographically provable privacy guarantees.
- Encrypted data cannot be easily worked with if previously unknown to the user.
- Complexity of encryption and decryption operations.
- Performance overhead for cryptographic operations.
- May require specialized libraries and expertise.
- Basic data analytics in cloud computing environments.
- Privacy-preserving machine learning on sensitive data.
8Federated LearningTrains machine learning models across decentralized edge devices or servers holding local data samples, avoiding centralized data sharing.- Preserves data locality and privacy, reducing data transfer.
- Supports collaborative model training on distributed data.
- Suitable for privacy-sensitive applications.
- Complexity of coordination among edge devices or servers.
- Potential communication overhead.
- Ensuring model convergence can be challenging.
- Shared models can still leak privacy.
- Healthcare institutions collaboratively training disease prediction models.
- Federated learning for mobile applications preserving user data privacy.
- Privacy-preserving AI in smart cities.
9Synthetic Data GenerationCreates artificial data that mimics the statistical properties of the original data while protecting privacy.- Strong privacy protection with high data utility.
- Preserves data structure and relationships.
- Scalable for generating large datasets.
- Accuracy and representativeness of synthetic data may vary depending on the generator.
- May require specialized algorithms and expertise.
- Sharing synthetic healthcare data for research purposes.
- Synthetic data for machine learning model training.
- Privacy-preserving data sharing in financial analysis.
10Secure Multiparty Computation (SMPC)Enables multiple parties to jointly compute functions on their private inputs without revealing those inputs to each other, preserving privacy.- Strong privacy protection for collaborative computations.
- Suitable for multi-party data analysis while maintaining privacy.
- Offers security against collusion.
- Complexity of protocol design and setup.
- Performance overhead, especially for large-scale computations.
- Requires trust in the security of the computation protocol.
- Privacy-preserving data aggregation across organizations.
- Collaborative analytics involving sensitive data from multiple sources.
- Secure voting systems.

The best and the worst data anonymization approaches

When it comes to choosing the right data anonymization approach, we are faced with a complex problem requiring a nuanced view and careful consideration. When we put all the Schmäh aside, choosing the right data anonymization strategy comes down to balancing the so-called privacy-utility trade-off. 

The privacy-utility trade-off refers to the balancing act of data anonymization’ two key objectives: providing privacy to data subjects and utility to data consumers. Depending on the specific use case, the quality of implementation, and the level of privacy required, different data anonymization approaches are more or less suitable to achieve the ideal balance of privacy and utility. However, some data anonymization approaches are inherently better than others when it comes to the privacy-utility trade-off. High utility with robust, unbreakable privacy is the unicorn all privacy officers are hunting for, and since the field is constantly evolving with new types of privacy attacks, data anonymization must evolve too. 

As it stands today, the best data anonymization approaches for preserving a high level of utility while effectively protecting privacy are the following:

Synthetic Data Generation 

Synthetic data generation techniques create artificial datasets that mimic the statistical properties of the original data. These datasets can be shared without privacy concerns. When properly designed, synthetic data can preserve data utility for a wide range of statistical analyses while providing strong privacy protection. It is particularly useful for sharing data for research and analysis without exposing sensitive information.

Privacy: high

Utility: high for analytical, data sharing, and ML/AI training use cases

Homomorphic Encryption

Homomorphic encryption allows computations to be performed on encrypted data without the need to decrypt it. This technology is valuable for secure data processing in untrusted environments, such as cloud computing. While it can be computationally intensive, it offers a high level of privacy and maintains data utility for specific tasks, particularly when privacy-preserving machine learning or data analytics is involved. Depending on the specific encryption scheme and parameters chosen, there may be a trade-off between the level of security and the efficiency of computations. Also, increasing security often leads to slower performance.

Privacy: high

Utility: can be high, depending on the use case

Secure Multiparty Computation (SMPC) 

SMPC allows multiple parties to jointly compute a function over their private inputs without revealing those inputs to each other. It offers strong privacy guarantees and can be used for various collaborative data analysis tasks while preserving data utility. SMPC has applications in areas like secure data aggregation and privacy-preserving collaborative analytics.

Privacy: High

Utility: can be high, depending on the use case

Data anonymization tools: the saga continues    

In the ever-evolving landscape of data anonymization strategies, the journey to strike a balance between preserving privacy and maintaining data utility is an ongoing challenge. As data grows more extensive and complex and adversaries devise new tactics, the stakes of protecting sensitive information have never been higher.

Legacy data anonymization approaches have their limitations and are increasingly likely to fail in protecting privacy. While they may offer simplicity in implementation, they often fall short in preserving the intricate relationships and structures within data.

Modern data anonymization tools, however, present a promising shift towards more robust privacy protection. Privacy-enhancing technologies have emerged as powerful solutions. These tools harness encryption, machine learning, and advanced statistical techniques to safeguard data while enabling meaningful analysis.

Furthermore, the rise of synthetic data generation signifies a transformative approach to data anonymization. By creating artificial data that mirrors the statistical properties of the original while safeguarding privacy, synthetic data generation provides an innovative solution for diverse use cases, from healthcare research to machine learning model training.

As the data privacy landscape continues to evolve, organizations must stay ahead of the curve. What is clear is that the pursuit of privacy-preserving data practices is not only a necessity but also a vital component of responsible data management in our increasingly vulnerable world.

At MOSTLY AI we talk about data privacy a lot. And we were even the first in the world to produce an entire rap dedicated to data privacy!

But what really is data privacy? And what is it not? This blog post aims to provide a clear understanding of the definition of data privacy, its importance, and the various measures being taken to protect it.

The data privacy definition

Data privacy, also referred to as information privacy or data protection, is the concept of safeguarding an individual's personal information from unauthorized access, disclosure, or misuse. It entails the application of policies, procedures, and technologies designed to protect sensitive data from being accessed, used, or shared without the individual's consent.

To fully understand data privacy we thus need to understand Personal information first. Personal information, often referred to as personally identifiable information (PII), is any data that can be used to identify, locate, or contact an individual directly or indirectly.

Personal information encompasses a wide range of data points, including but not limited to, an individual's name, physical address, email address, phone number, Social Security number, driver's license number, passport number, and financial account details. Moreover, personal information can extend to more sensitive data such as medical records, biometric data, race, ethnicity, and religious beliefs. In the digital realm, personal information may also include online identifiers like IP addresses, cookies, or device IDs, which can be traced back to a specific individual.

In essence, data privacy is all about the protection of personal information. Why is that important?

Why is data privacy important?

Even if you don’t care about data privacy at all, the law cares. With numerous data protection regulations and laws in place, such as the General Data Protection Regulation (GDPR) in the European Union, it is essential for organizations to adhere to these regulations to avoid legal consequences. Gartner predicts that by 2024, 75% of the global population will have its personal data covered under privacy regulations.

Many companies have realized that data privacy is not only a legal requirement, but something customers care about too. In the Cisco 2022 Consumer Privacy Survey, 76 percent of respondents said they would not buy from a company who they do not trust with their data. Ensuring data privacy helps maintain trust between businesses and their customers and can become an important competitive differentiation.

Data privacy is an important element of cybersecurity. Implementing data privacy measures often leads to improved cybersecurity, as organizations take steps to safeguard their systems and networks from unauthorized access and data breaches. This helps to ensure that sensitive personal information such as financial data, medical records, and personal identification details are protected from identity theft, fraud, and other malicious activities.

And in case you’re still not convinced, how about this: The right to privacy or private life is enshrined in the Universal Declaration of Human Rights (Article 12) – data privacy is a Human Right! Data privacy empowers individuals to have control over their personal information and decide how it is used, shared, and stored.

All data is personal data in today's era because it can be used to reidentify people

How to protect data privacy in an organization?

Every company, every business is collecting and working with data. To ensure data privacy there is not one thing that a company needs to do, but many things.

Foremost data privacy needs to start from the top in an organization because leadership plays a critical role in establishing a culture of privacy and ensuring the commitment of resources to implement robust data protection measures. When executives and top management prioritize data privacy, it sends a clear message throughout the organization that protecting personal information is a fundamental aspect of the company's values and mission. This commitment fosters a sense of shared responsibility, guiding employees to adhere to privacy best practices, comply with relevant regulations, and proactively address potential risks.

Once the support from the top management is established, data privacy needs to be embedded in an organization. This is typically achieved through implementing privacy policies. Organizations should have clear privacy policies outlining the collection, use, storage, and sharing of personal information. These policies should be easily accessible and comprehensible to individuals.

These policies define certain best practices and standards when it comes to data privacy. Companies that take data privacy seriously follow these, for example:

An entire industry around best practices and how these can be ensured (and audited!) has emerged.: Regularly auditing and monitoring data privacy practices within an organization helps identify any potential vulnerabilities and rectify them promptly.

The two most recognized standards and audits are ISO 27001 and SOC 2. ISO 27001 is a globally recognized standard for information security management systems (ISMS), providing a systematic approach to managing sensitive information and minimizing security risks. By implementing and adhering to ISO 27001, organizations can showcase their dedication to maintaining a robust ISMS and assuring stakeholders of their data protection capabilities.

On the other hand, SOC 2 (Service Organization Control 2) is an audit framework focusing on non-financial reporting controls, specifically those relating to security, availability, processing integrity, confidentiality, and privacy. Companies undergoing SOC 2 audits are assessed on their compliance with the predefined Trust Services Criteria, ensuring they have effective controls in place to safeguard their clients' data.

By leveraging ISO 27001 and SOC 2 standards and audits, organizations can not only bolster their internal security and privacy practices but also enhance trust and credibility with clients, partners, and regulatory bodies, while mitigating risks associated with data breaches and non-compliance penalties. We at MOSTLY AI have heavily invested in this space and are certified under both ISO 27001 and SOC 2 Type.

Lastly, let’s turn to the human again: the employees. Numbers are floating around the Internet that claim to show that 95% of all data breaches happen due to human error. Although the primary source for this number could not be identified, it’s probably correct. Therefore, educating employees about data privacy best practices and the importance of protecting sensitive information plays a crucial role in preventing breaches caused by human error.

Data privacy is everyone's business

Data privacy is an essential aspect of our digital lives, as it helps protect personal information and maintain trust between individuals, businesses, and governments. By understanding the importance of data privacy and implementing appropriate measures, organizations can reduce the risk of data breaches, ensure compliance with data protection laws, and maintain customer trust. Ultimately, data privacy is everyone's responsibility, and it begins with awareness and education.

The protection of personally identifiable information (PII) has become an important concern in the data industry. As part of regular data-processing pipelines, datasets often need to be shared to be processed, for example with external clients or with cloud services for high-performance processing. During this process, it’s vital to ensure that sensitive information is not exposed. Data privacy compliance breaches can do serious harm to a company’s reputation and can result in high fines.

Data anonymization is a technique that can be used to protect personally identifiable information by removing or obfuscating sensitive data. Python is a popular programming language that can be used to perform data anonymization. In this article, we will explore four different techniques for data anonymization in Python: randomization, aggregation, masking, and perturbation.

Performing data anonymization in Python with open-source solutions can be a low-effort method for providing a basic level of privacy protection. However, there are important security tradeoffs to consider. While performing data anonymization in Python may be helpful in quick prototyping scenarios, these techniques are generally considered legacy data anonymization techniques that do not offer sufficient protection for data pipelines running in production. Fully synthetic data is the new industry-standard for production-grade data analysis.

What is data anonymization?

Data anonymization is the process of removing or obfuscating personally identifiable information from datasets. The goal of data anonymization is to protect the privacy of individuals whose data is included in the dataset. Anonymized data can be shared more freely than non-anonymized data, as the risk of exposing sensitive information is greatly reduced.

Data anonymization in Python techniques

There are several techniques that can be used for performing data anonymization in Python. These techniques include randomization, aggregation, masking, and perturbation.

1. Randomization

Randomization involves replacing sensitive data with random values. For example, a person's name might be replaced with a randomly generated string of characters. 

Let’s look at how to perform data anonymization in Python using the randomization technique. We’ll start with something simple. We’ll use Python’s built-in random library together with pandas to scramble the characters in the Name column in an attempt to obscure the identities of the people in our dataset:

import pandas as pd
import random

# Load the dataset
df = pd.read_csv('sensitive_data.csv')

df
# define a function to randomize column values
def randomize_values(col_values):
    col_values_list = list(col_values) # convert string to list
    random.shuffle(col_values_list)
    return ''.join(col_values_list) # convert list back to string

# apply the function to the desired column(s)
column_to_randomize = 'Name'
df[column_to_randomize].apply(randomize_values)

df

This is clearly a very rudimentary anonymization technique. The good thing about this technique: it’s quick. The bad thing: it may fool a third-grader…but not much more. The first letters of the names are still capitalized and from there, it’s not very hard to imagine what the real names might be, especially if someone has prior knowledge of the people in the dataset. Of even more concern is the fact that the addresses and ages are still clearly visible. We need to do better.

Let’s expand our randomize_values function to scramble all of the columns containing strings in our dataframe. We’ll use random.choices() instead of random.shuffle() to improve our anonymization:

import string


# define function that operates on entire dataframe
def randomize_values(df):
    for column in df.columns:
        if df[column].dtype == 'O': # check if column has object dtype
            df[column] = [''.join(random.choices(string.ascii_letters + string.digits, k=10)) for _ in range(len(df))] # generate a list of random strings
    return df

# apply function to dataframe
df_rand = randomize_values(df)

df_rand

This is looking much better! The downside here is that it'll be easy to lose track of who's who this way. To help with this, it’s recommended practice to create a lookup table. We will give each row a unique identifier so that we can use that as a key to look up how the anonymized rows correspond to their original entries.

Let’s add a new row UniqueID to the original, non-scrambled DataFrame:

# add a new column with unique integer-only IDs
df['UniqueID'] = list(range(1001, 1005)) 
df

In this case, we chose to create a unique ID column of int data type so that our randomize_values function will not scramble it. For production purposes, you will probably want to build something a little more scalable and robust, for example using the Python uuid library.

We can now apply randomize_values again to get the anonymized rows with the unique IDs.

# apply function to dataframe
df_rand = randomize_values(df)

df_rand

Randomization is a relatively low-effort method to perform data anonymization in Python. It’s important to note, however, that this low-effort benefit comes with some serious tradeoffs. First of all, the data utility has decreased significantly: it’s hard to imagine running any meaningful analysis on top of the scrambled City names, for example. Secondly, there are tradeoffs to consider in terms of robustness and security. For example, if a dataset contains a small number of unique values, it may be possible to use statistical analysis to identify individuals based on the random values. 

Let’s now look at a second technique for performing data anonymization in Python: aggregation.

2. Aggregation

Aggregation involves combining data from multiple individuals to create a group-level view of the data. For example, instead of storing data for each individual separately, data might be aggregated into ranges or groups.

Let’s say we're happy with the randomization technique used above for hiding the names and addresses of the people in our dataset. However, we want to take our data anonymization one step further and also hide the numerical values. We can use Python to aggregate the numerical values, for example anonymizing the ages by grouping the individuals in our dataset into age brackets using the pandas .cut() method and specifying the bins and labels:

# Anonymize the ages by grouping them into age ranges
bins = [0, 18, 30, 45, 60, 100]
labels = ['0-18', '19-30', '31-45', '46-60', '60+']
df['Age'] = pd.cut(df['Age'], bins=bins, labels=labels)

df

We can do something similar with the salaries:

# Anonymize the salaries by grouping them into ranges
bins = [0, 30_000, 50_000, 80_000, 100_000, 200_000]
labels = ['0-30K', '31-50K', '51-80K', '81-100K', '100K+']
df['Salary'] = pd.cut(df['Salary'], bins=bins, labels=labels)

df

Excellent, it's now no longer possible to get any personally identifiable age or salary characteristics from our anonymized dataset. This was a relatively simple technique to achieve data anonymization. However, we've traded it for a significant amount of granularity. In the Salary column, we now have only 2 unique values (31-50K and 81-100K) instead of the original four. This reduces the types of analysis we can run on this dataset, decreasing its data utility.

There are many other ways to achieve anonymization by aggregation in Python, for example using groupby(). The important thing to remember is that while aggregation is technically an effective technique for data anonymization, it may often not be a feasible solution for your use case, especially if your analysis requires specific levels of data granularity.

3. Masking

The third technique for performing data anonymization in Python is masking. Masking involves replacing sensitive data with a similar but non-sensitive value. For example, a person's name might be replaced with their initials or a pseudonym. 

In the randomization example above, we replaced people's names, cities, and street names with random characters. This is an effective anonymization technique (provided you have created a correct and securely-stored lookup table) but can make the dataset less intuitive to work with. If humans are going to be part of the data analysis process, you may want to use an anonymization technique where the anonymized contents still indicate something about the type of information they contain. Pseudonyms can be helpful for this.

Let's take a look at some Python code that uses masking to anonymize the names, cities, and street names in our dataset:

# take a look at the original dataset
df
# mask the sensitive values by using pseudonyms for the names, cities and street names
masked = df.copy()
masked['Name'] = ['Stephanie', 'Marcus', 'Yasmin', 'Oprah']
masked['City'] = ['Amsterdam', 'Zagreb', 'Houston', 'London']
masked['StreetName'] = ['Central Road', 'Independence Avenue', 'Home Path', 'Long Walk']

masked

Masking can be an effective technique for data anonymization, but it may not always provide sufficient protection. If the masked value is still unique to an individual, it may be possible to use statistical analysis to identify them. It’s also important to note that pseudonymization alone is not GDPR-compliant.

Masking can also be difficult to perform programmatically. In the code above, we manually entered the alternative values. This is feasible for a small toy dataset with 4 rows, but imagine having to come up with and type out pseudonyms for a dataset containing millions of rows (!) Unless you can find a programmatic way to mask the data, masking may mean trading in efficiency for human legibility. 

There are open-source Python libraries available that help you to perform this type of masking programmatically. One example is anonymizedf, which builds on pandas and faker to easily substitute original columns with masked substitutes. We’ll walk through a quick example below:

from anonymizedf.anonymizedf import anonymize

# prepare data for anonymization
an = anonymize(df)

# add masked columns
fake_df = (
    an
    .fake_names("Name", chaining=True)
    .fake_whole_numbers("Salary", chaining=True)
    .fake_whole_numbers("Age", chaining=True)
    .fake_categories("City", chaining=True)
    .fake_whole_numbers("HouseNumber", chaining=True)
    .show_data_frame()
)

# subset only anonymized columns
fake_df = fake_df[['Fake_Name', 'Fake_Salary', 'Fake_Age', 'Fake_City', 'Fake_HouseNumber', 'StreetName']]

fake_df

Anonymizedf and Faker are helpful open-source solutions that can help you perform data anonymization in Python. However, they also have their drawbacks. Being open-source, there are security risks associated with using these solutions on production data. The solutions are also limited in their flexibility: the an.fake_whole_numbers method, for example, simply outputs random integers between the lowest and highest value found in the original column. There is no way to control the distribution of the values in that column which is important for downstream machine-learning and other analysis projects, as we’ll see in the next section. 

4. Perturbation

The fourth and final technique for performing data anonymization in Python is perturbation. Perturbation involves adding random noise to sensitive data to make it harder to recognize. For example, a person's salary might be increased or decreased by a small amount to protect their privacy. The amount of noise added can be adjusted to balance privacy with data utility. Data utility is generally a function of how well we can preserve the overall distribution in the dataset. 

Perturbation is generally only used for numerical and categorical columns. Let's take a look at an example of performing perturbation on a numerical column in Python. We'll write a function called add_noise that will use the numpy library to add noise to the Salary column. The amount of noise can be controlled using the std (standard deviation) keyword argument.

import numpy as np

def add_noise(df, column, std = None):
    if std == None:
        std = df[column].std()
    
    withNoise = df[column].add(np.random.normal(0, std, df.shape[0]))
    copy = df.copy()
    copy[column] = withNoise
    return copy

perturbation = add_noise(df, 'Salary', std=100)
perturbation

If we compare this to our original Salary values, we’ll see a minor deviation. This will likely preserve the original distribution of the dataset, but is it enough to guarantee the privacy of the individuals in our dataset? Probably not.

A toy dataset with 4 rows is not enough data to observe the effect of perturbation on the dataset’s distribution. Let’s work with a slightly larger fictional dataset that has 60 rows of data. This will allow us to clearly see the tradeoff of privacy (perturbation) and accuracy (data utility). For reproducibility of the code in this tutorial, we’ll create a larger DataFrame by simply copying the original df 15 times.

# create a large dataset
df_large = pd.concat([df, df, df, df, df, df, df, df, df, df, df, df, df, df, df])

Let’s plot a histogram of the original Salary column:

# plot the distribution of the salary column using 5K bins
plt.hist(
    'Salary', 
    data=df_large, 
    bins = np.arange(start=30_000, stop=100_000, step=5_000),
)
plt.title(“Original Distribution”);

Now apply our add_noise function with varying degrees of noise:

df_large_pert_10 = add_noise(df_large, 'Salary', std=10)
df_large_pert_100 = add_noise(df_large, 'Salary', std=100)
df_large_pert_2000 = add_noise(df_large, 'Salary', std=2000)

And then visualize the distributions with noise:

fig, (ax1,ax2,ax3,ax4) = plt.subplots(nrows=1, ncols=4, sharey=True, figsize=(20,10))
ax1.hist(
    'Salary', 
    data=df_large, 
    bins = np.arange(start=30_000, stop=100_000, step=5_000),
)
ax1.title.set_text('Original Distribution')
ax2.hist(
    'Salary', 
    data=df_large_pert_10, 
    bins = np.arange(start=30_000, stop=100_000, step=5_000),
)
ax2.title.set_text('With Noise - std=10')
ax3.hist(
    'Salary', 
    data=df_large_pert_100, 
    bins = np.arange(start=30_000, stop=100_000, step=5_000),
)
ax3.title.set_text('With Noise - std=100')
ax4.hist(
    'Salary', 
    data=df_large_pert_2000, 
    bins = np.arange(start=30_000, stop=100_000, step=5_000),
)
ax4.title.set_text('With Noise - std=2000')
;

As we can see, adding noise (increasing privacy) can lead to a change in the distribution of the dataset (decreasing accuracy). Finding the perfect balance where privacy is ensured and accuracy is maintained is a difficult task to execute manually.

Automate data anonymization with MOSTLY AI

MOSTLY AI offers a fully-managed, no-code service for performing data anonymization. You can generate fully anonymized, synthetic datasets that maintain the distributions of your original dataset, striking that sweet spot between guaranteed privacy and maximum data utility. It offers built-in AI recognition of all data types and provides you with detailed reports to inspect both the utility (accuracy) and security (privacy) of your synthetic data. It takes into account any correlations between columns (both within and between related tables) and can automatically perform data augmentation techniques like imputation and rebalancing. Give it a try by signing up for a free account, we give you 100K rows of synthetic data for free, every day.

Data anonymization in Python: conclusion

Data anonymization is a critical step in protecting sensitive data and ensuring compliance with data privacy regulations. While Python provides libraries that can be leveraged to perform data anonymization. As we have seen in this blog, each of the four techniques presented also has serious drawbacks. They all require manual coding (and are thus sensitive to human error) and in many cases don’t actually provide the necessary level of privacy protection. That’s why performing data anonymization yourself, for example in Python, is generally considered to be a legacy technique that is not suitable for production environments.

Synthetic data anonymization is one of the core generative AI use cases for tabular data. Synthetic data provides solid guarantees about security and privacy protection. This type of data is completely made up and therefore contains virtually no risks of exposing any sensitive information from the original dataset. Powerful deep learning algorithms extract characteristic patterns, correlations and structures from the original dataset and use that to generate data that is entirely synthetic. As data privacy regulations continue to evolve, it is essential to stay up-to-date with the latest techniques and best practices for data anonymization. By doing so, you can ensure that your data is protected, and your business remains compliant. If you don’t want to worry about the risks of performing your data anonymization manually, consider giving MOSTLY AI a try and let us know how you get on!

Download your guide to data anonymization!

Find out how legacy data anonymization techniques, like randomization, perturbation, aggregation and data masking compare to sample-based synthetic data from a privacy perspective.
Download the guide

TL;DR: To date, synthetic data is not a clearly defined term. Part of the reason why we lack a widely-used synthetic data definition might be that a broad variety of synthetic data categories exists – and that different types of synthetic data are used for different purposes. Yet, if it’s privacy protection what you are after, there is only one synthetic data category that is worth the hype. Only one that offers bulletproof privacy protection and near-perfect data utility: privacy-preserving, AI-generated synthetic data. This article aims to bring forward a definition for privacy-preserving, AI-generated synthetic data and, hopefully, clears up some of the confusions about what AI-generated synthetic data actually is.

Read on for a deeper dive into all the elements that make up the following definition of truly privacy-preserving, AI-generated synthetic data. You might also find this article worthwhile to learn how you can successfully sift through the constantly growing sea of synthetic data tools and solutions out there, to find those that truly offer privacy protection. But knowing that this post is rather extensive and that most of you probably wouldn’t get to the end, let me start with the definition of synthetic data right here on top ⬇️

The definition of privacy-preserving, AI-generated synthetic data

Synthetic data as a rapidly evolving technology is not yet a clearly defined term. While it was first mentioned decades ago, older types of synthetic data do not bear any resemblance to the powerful AI-generated synthetic data we have today. Although there are various application areas where AI-generated synthetic data is of value, only one category is relevant for privacy protection: privacy-preserving, AI-generated synthetic data.

Privacy-preserving, AI-generated synthetic data is defined as an anonymization technology that preserves data utility. It is artificial data created by a machine learning model trained on real-world data that accurately and granularly retains the statistical properties of the real data it was trained upon. Yet, it is generated with a holistic set of privacy mechanisms that ensure absolute, irreversible anonymization.

Privacy-preserving, AI-generated synthetic data is synthesized at the user-level, not only at the event-level. It is fully synthetic as opposed to partially synthetic. It does not contain any 1:1 relationships between real and synthetic data subjects – and it is impossible to re-identify.

To better understand the different elements of this AI-generated synthetic data definition, feel free to jump to the specific sections in the blog post:

Disclaimer: This post will focus on AI-generated structured synthetic data – think financial transactions, mobility data, or healthcare records – for the purpose of privacy protection. AI-generated unstructured synthetic data (like videos, images, voice, or long sequences of free text) is not included.

What’s wrong with most definitions of synthetic data? (Or: Why am I writing this blog post?)

Thinking back, it’s quite funny. For years I’ve been telling people why and how AI-generated synthetic data is replacing legacy anonymization techniques. Meanwhile, synthetic data’s potential is very well understood – and so are the risks of relying on outdated anonymization techniques.

Over time, this led to significantly more media coverage of synthetic data – which is exciting to see. Even though I sometimes stumble upon bad press for this emerging technology. Don’t get me wrong. I’m not against critical thinkers sharing their take on this or any other new technology. But in 90% of those articles I can’t help but notice that the authors mixed up different synthetic data types: AI-generated synthetic data and legacy approaches of synthetic data generation (or even worse: legacy anonymization techniques). And understandably so. Even though synthetic data is widely talked about, it is still not a clearly defined term. But what is the result of this lack of terminology? Confusion about what synthetic data actually is paired with various critical points that simply don’t apply to state-of-the-art synthetic data.

Another common theme of synthetic data articles is that different categories of AI-generated synthetic data get mixed up to then put the technology's privacy-preservation capabilities into question – which doesn’t make sense when a category of AI-generated synthetic data is evaluated that has a purpose different from protecting privacy. So, it might be time for a proper definition of privacy-preserving, AI-generated synthetic data with a focus on how you can differentiate it not only from legacy anonymization but also legacy data synthesis techniques. Let me give it a shot (to hopefully help to clear up some of this confusion)!

What is the difference between legacy and AI-generated synthetic data?

Legacy synthetic data

Although the hype for synthetic data and its adoption in Fortune 500 companies started only in recent years, the term itself is not a new one. “Synthetic data” has been around for more than three decades. Back then, it referred to overly simplistic mock data (or dummy data). This type of synthetic data was rule-based – meaning a human or a machine followed simple rules to create artificial data points.

To give you a feel for that, you can try rule-based synthetic data generation yourself. Just write down a table with 10 individuals - 50% female and 50% male - and please come up with suitable given names, surnames, home addresses (let’s say in NYC), and birthdates. If you want to go wild, you could even add a shiny credit card transaction for each of your artificial customers. To do that, just write down the date/time, merchant, and amount of money that was spent.

Congrats, you’ve just generated rule-based synthetic data! (Assuming that you didn’t get too creative and came up with birthdates like 17th of April 1622 or ZIP codes that ignore the “5-digits”-rule of NYC ZIP codes.) Do that times 1,000 or 10,000 and you’ve built yourself a rule-based synthetic dataset to perform some basic software tests.

The problem with this type of synthetic data is that it’s of limited use. It follows some rules that are already known to you or the machine creating it (like the distribution of 50% males and females, typical female first names and how ZIP codes in NYC are structured). While this might be sufficient for very basic tests, there won’t be any previously unknown insights hidden in rule-based synthetic data. So what does this mean for AI or analytics? That rule-based synthetic data is completely useless!

PrivacyUtility
Legacy synthetic data (rule-based)Private, although accidental matches are still possibleLimited, frequently used for testing new features in software development
AI-generated synthetic dataPrivate when supported by additional privacy mechanismsAlmost 100%, retaining correlations and, distributions on a granular level
Comparison of synthetic data types, including legacy technologies

AI-generated synthetic data

But how is AI-generated synthetic data different from legacy rule-based synthetic data? It is trained on high-dimensional, complex real-world data and - thanks to powerful deep learning algorithms - capable of automatically extracting its patterns, correlations, time dependencies and structures. Once the synthetic data generator is trained, it can be used to create one (or even multiple, if you would like) new synthetic data sets from scratch. These AI-generated synthetic datasets are fully anonymous and don’t include privacy-sensitive information anymore. But what they do include are all the same patterns, correlations, and structures of the original training data, which are then readily available for you to uncover valuable insights - without infringing on your real customers’ privacy.

To sum up, when AI entered the synthetic data game, it brought an unprecedented level of data utility. In contrast to rule-based synthetic data, this new breed of synthetic data is suitable for a broad variety of use cases – ranging from AI training and advanced analytics to complex software testing, digital product development, and external data sharing.

comparison of synthetic data types
To better illustrate the difference in realism and utility, a visual comparison of - therefore - unstructured legacy and AI-generated synthetic data.

Watch out, fake “synthetic data” is jumping on the wagon

With the ever-growing hype around synthetic data, it comes as no surprise that some vendors decided to call synthetic data what – if you look a little closer – turns out to be nothing more than perturbation, obfuscation, or any of the other legacy anonymization techniques. Sure, now that synthetic data is frequently covered in the media, adopted by major players in finance as well as other industries, and picked up by analyst firms like Gartner – who are strongly advising their enterprise customers to incorporate this emerging technology into their AI and data strategies – it is tempting to do a little re-branding and try to get yourself a slice of the pie.

But watch out! Only putting a fancy “synthetic data” sign on top, doesn’t get you the benefits, data quality, and protection levels of having truly privacy-preserving, AI-powered synthetic data generation under the hood. Just adding some noise to a dataset or using another perturbation technique comes with the full range of privacy risks – which we’ve extensively covered in previous blog posts, so I’ll refrain from reiterating at this point. Instead, I’ll stick to urging you to look more closely when confronted with shiny “synthetic data” marketing materials.

AI-generation is just one piece of the synthetic data puzzle. Purpose is another.

As mentioned in the intro, even if we only look into AI-generated synthetic data technology, there are different categories of it. A good way to distinguish between them is the purposes they are used for. The two most common ones are privacy protection and artificially creating more data where there is a lack of it.

Synthetic data for privacy protection of real-world data

The former needs plenty of real-world data to learn from. Subsequently, it uses this knowledge to create highly realistic and statistically representative, yet fully anonymous synthetic data. Think of this privacy-preserving category of AI-generated synthetic data as an enabling technology. It is used to anonymize existing datasets without destroying the original dataset’s value and utility. This helps organizations to safely unlock and innovate on top of their data assets in compliance with even the strictest privacy regulations like GDPR.

Synthetic data when not enough real-world data exists

The other most prominent category of AI-generated synthetic data focuses on creating more data where there’s a lack of it. Think of autonomous vehicles. During development and training, businesses and researchers want to make sure that they are safe to use and that they ALWAYS stop when there is an obstacle approaching the car. It must not make a difference from which of the 360 possible angles a rabbit is running on the street. Whether there is blinding light, twilight, bright light, or nearly no light during nighttime. Even if there is rain, hail or storm shouldn’t make a difference in the car’s ability to spot the obstacle and slam on the brakes.

Synthetic images are used for training computer vision algorithms
Synthetic image data for training computer vision algorithms - image courtesy of Anyverse

But how to get all that training data without spending hundreds of thousands of dollars on a video crew tasked with catching rabbits running on streets on their cameras? Here, the second category of AI-generated synthetic data can help to get high-quality training data at significantly lower costs. Oftentimes, it just gets a little seed training data (for example, a few different videos of rabbits running on the street) to get the synthetic data generation algorithm started. In our example, these seed video sequences plus the laws of physics (e.g. how shadows change depending on the source of light or time of day) help the algorithm to create a myriad of highly useful synthetic “rabbit running on the street” videos. In turn, these AI-generated videos are again used to train AI and help the self-driving car’s algorithm to become 100% rabbit-safe.

Without privacy mechanisms, synthetic data is NOT a privacy-enhancing technology

Plain AI-generated synthetic data might do a stellar job in preserving the utility part of the original data it was trained on, but it does not automatically help you with privacy protection. No matter whether you are building an AI-powered synthetic data generator yourself, use open source tools to generate your synthetic data, or rely on a trusted and tried vendor solution – it is essential that all the necessary privacy mechanisms and privacy checks are implemented in the synthetic data generation process. Without them, you risk leaking privacy-sensitive information in the newly created artificial data.

Which privacy mechanisms are essential, you might wonder? Some of them intuitively appear logical. For example, the prevention of overfitting. One must ensure that the AI model is not memorizing the real-world data it is trained on. Otherwise, you would pretty much get a copy of your privacy-sensitive source data.

The science of privacy-preserving synthetic data generation lies in extracting the generalizable patterns, structures, and insights hidden in a dataset, while leaving the personal secrets and privacy-sensitive parts behind. To pull that off, the prevention of overfitting is just one piece of the puzzle.

But bulletproof privacy protection goes beyond protecting the sensitive information of individuals. You also want to account for group privacy and prevent membership attacks. This means, a truly privacy-preserving synthetic dataset not only protects the privacy of extreme individual outliers but also the sensitive information of small groups (think, the 5 individuals in country X suffering from a super rare disease). Here, privacy mechanisms like rare category protection (RCP) help to exclude extremely rare occurrences within the source data already before the AI-powered synthetic data model is trained upon it.

Often forgotten is that not only the specific data points but also the metadata and data structure need to be protected to prevent privacy leakage. How entities relate to and interact with each other is highly unique and oftentimes specific to individuals. This is easiest understood with graph data. Imagine a table that includes a list of all the calls and messages you – and 20 million other customers of a fictitious telco company – received and made or sent over the last month.

While it is intuitively clear that the caller IDs of the sender and receiver or the date, time, and duration of a call are highly privacy-sensitive, it is oftentimes overlooked that also that which is not in plain sight – the metadata or the structure of how different entities relate to each other – is in urgent need of protection. Just recently, there was an excellent paper published in Nature that once again underlined what should be well understood: also metadata is data - and if it is about real-world individuals, then metadata is personal data that needs to be adequately protected.

The paper showed that only by looking at the patterns in the metadata it is easy to re-identify individuals. How frequently you call your respective contacts, at which time of the day, the duration of your calls, or the number of messages you exchange with your best friend Stacy are highly unique behavioral patterns that tend to stay constant over time. What the paper showed was, that even if an adversary can’t link you to the call that you made to President Biden on the 16th of May 2022 at 3:17 am to complain about the constant air traffic flying over your house and waking you up in the middle of the night, the simple fact that you try to reach the US president 3 times over the course of a month to lodge your complaint is a behavioral pattern of yours (although presumably one that is not leading to much success). Pair that with your weekly overseas calls to Auntie Ann living in London and a few other calling habits of yours, and you’ve successfully created yourself a digital behavioral fingerprint that makes it easy to single you out: From the supposedly anonymous call records of May 2022, as well as from those of January the same year and probably even August the year before.

Who is President Biden talking to? It's not hard to find out.

What I’m getting at is that we have entered a time where direct linkage attacks are just one of your many privacy concerns. Yes, you still need to prevent that (supposedly) anonymous data can be linked with auxiliary information that directly matches parts of the (not so anonymous) dataset. But nowadays, adversaries can even re-identify your (not so) anonymous data by matching behavioral patterns or profiles of two datasets from different periods of time – and easily so if they throw AI-powered profiling attackers in the mix. To read more about this equally mind-blowing as well as scary new breed of privacy attacks (and how our synthetic data protects against them), I highly recommend you check out our blog post on AI-based re-identification attacks.

As you can see, privacy protection is complex and needs to be approached holistically. While this is by no means an exhaustive list of all the privacy mechanisms that should go into a truly privacy-preserving synthetic data generator (that’s something for a future blogpost ;-)), what I want you to remember is: you must implement a whole set of privacy mechanisms into an AI-powered synthetic data generator to successfully protect privacy – but if you do that, you will be able to achieve absolute, irreversible anonymization.

To truly protect privacy, make sure you’re getting fully and not partly AI-generated synthetic data

Admittedly, we are now entering the weeds of selecting a truly privacy-preserving synthetic data solution. But for the sake of completeness: if you want bulletproof privacy protection, make sure that you are getting fully and not partly AI-generated synthetic data. To not be mistaken, there are some highly interesting use cases for partly synthetic data and it can be an invaluable tool to augment real-world data. But it’s not the right approach if privacy protection is what you are after.

How to distinguish between partly and fully synthetic data?
The former pairs either...

a)  real-world data, or

b)  traditionally anonymized data (which, as you know, is simply not anonymous in the era of big data anymore .Thus it is basically real-world data 😉)

...with synthetic data.

So what you get out of this is a mix of partly real data and partly artificially created data – and as you could have guessed, that’s not privacy safe at all! Sure, the synthetic part won’t be your privacy problem. But if there are bits and pieces of real, highly sensitive data in there, then the whole thing becomes personal data and must be protected accordingly. Your takeaway here? Merely applying data synthesis (or differential privacy, for that matter) column-based as opposed to a dataset in its entirety is more wishful thinking than actual privacy protection.

To be inclusive, there is another category of partly synthetic data that might not be that easy to spot. Remember the metadata/data structure problem I pointed out earlier? If you want to synthesize not only one dataset but a whole bunch of tables within your database, it is not a good idea to set up your synthetic data generator in a way that it just mindlessly fills out one table after the other with fresh batches of representative synthetic data.

At first glance, this might look innocent and perfectly privacy safe. And why shouldn’t it be? After all, each table on its own contains fully synthetic data and not a single real-world data point. But there is a catch! If you’re not synthesizing multiple tables in a way that all the fine-grained relations between the different tables are learned, then your synthetic data generator will fail to detect and remove potentially privacy-sensitive patterns in that relational metadata – and thereby introduce a privacy risk via the backdoor.

The cardinal rule of data synthesis: anonymize at the user-level and not just at the event-level

A similar problem can be encountered when data that should be synthesized (or anonymized) is not set up in a way that user-level anonymization can be achieved. What do I mean by that? Whenever we want to anonymize data, it is real-world individuals we want to protect. Thus, it is not sufficient to apply any anonymization technology only on the event- or trip-level.

Just recently this was nicely illustrated in the Nature paper “On the difficulty of achieving Differential Privacy in practice: user-level guarantees in aggregate location data”. The researchers analyzed a differentially private location dataset with the trips of 300 million Google Maps users over the course of one year.

As most of you know, differential privacy offers mathematical privacy guarantees that are quantified with the so-called epsilon value. The lower the value, the stronger the protection (below 1 is oftentimes recommended in academia). The higher the epsilon parameter, the closer you get to privacy-washing and guarantees of complete meaninglessness – which is why Apple and other big brands already earned harsh criticism for their differential privacy practices in the past. But back to our study – what was the problem here? The epsilon value that was reported was 0.66. “Awesome, completely anonymous!” one might think…but the researchers uncovered that the differential privacy guarantees were calculated for the trip-level only, based on the assumption that any one of the 300 million users did not contribute more than one of her/his trips to the dataset.

Unfortunately, this wasn’t the case. Individual users contributed more than one trip to the dataset. To get a more accurate picture of how well the actual users (and not only their individual trips) were protected, the researchers then went on and conservatively estimated the real, user-level epsilon value of this dataset to be closer to 46. Imagine that, 46! For just one week worth of trip data. Remember, the full data set contained the trips of a whole year. Thus, the actual epsilon value for the entire dataset could be 52-times as high and as large as 2392 – which due to the exponential nature of this parameter is more a guarantee for non-privacy than anything else.

I think, there are few examples out there that more powerfully illustrate how important it is to apply data synthesis, differential privacy, or any other anonymization technology to the user-level (e.g. ALL your trips or all your “events” like financial transaction” contained in the dataset) and not merely to the individual trips or events.

What does this mean for data synthesis in practice? Set up your tables in a way that ALL events belonging to a single data subject are connected to said user via a unique ID. Thereby you ensure that privacy protection is applied on the right level and that you achieve what all the emerging privacy regulations want you to achieve: that the privacy of individuals, of real-world humans is kept safe and secure.

Primary keys and foreign keys for synthetic data generation
Example of a two-table set up ready for generating privacy-preserving AI-generated synthetic data

In the example dataset above you see baseball players on the left and the seasons they played on the right. The first four seasons all belong to a single player and are connected to this person with a unique ID. This ID is referred to as the primary and foreign keys, which link tables together. Generating synthetic data with this set up guarantees the privacy of individuals when using MOSTLY AI's synthetic data generator.

Closing thoughts on this AI-generated synthetic data definition

Now that you’ve come this far, you are equipped with the knowledge to sift through the myriad of synthetic data tools and solutions and capable of identifying the ones that truly offer impeccable privacy protection. Even more importantly, I hope you gained a deeper understanding why I consider all these different elements important to be included in any definition for privacy-preserving, AI-generated synthetic data.

Rest assured, that we at MOSTLY AI take privacy very seriously – and that all the privacy risks highlighted above (and even those I didn’t have space for in this post) are covered with our synthetic data generation software. If you use MOSTLY AI’s synthetic data generator, you enjoy the benefits of truly privacy-preserving, AI-generated synthetic data: unparalleled data utility paired with bulletproof anonymization.

As for the definition, I’m curious to hear your comments and thoughts. Personally, I can’t wait to continue working on and refining this AI-generated synthetic data definition as part of the IEEE Synthetic Data IC Expert Group we recently established. If you, too, want to get involved and collaborate with an international group of synthetic data vendors and experts from corporates, academia, and the regulatory side – join us!

Privacy enhancing technologies protect data privacy in new ways. Legacy data anonymization techniques can no longer fully protect privacy. In their effort to mask or obfuscate data, legacy anonymization destroys data utility. As a result, these old technologies should not be considered to be privacy enhancing technologies or PETs.

Examples of privacy enhancing technologies

There are five major emerging privacy enhancing technologies that can be considered true PETs: homomorphic encryption, AI-generated synthetic data, secure multi-party computation, federated learning and differential privacy. These new generation privacy enhancing technologies are crucial for using personal data in safe ways.

Organizations handling sensitive customer data, like banks, are already using PETs to accelerate AI and machine learning development and to share data outside and across the organization. Most companies will end up using a combination of different PETs to cover all of their data use cases. Let's see how the five most promising privacy enhancing technologies work and when they come in handy!

1. Homomorphic encryption

Homomorphic encryption is one of the most well-known privacy enhancing technologies. It allows third parties to process and manipulate data in its encrypted form. In simple terms: someone who performs the analysis will never actually get to see the original data. But that's also one of the severe limitations of this technology. It's not helpful when the person who should do the analysis has no prior knowledge about the dataset as data exploration is virtually impossible.

Another limitation of homomorphic encryption is that it's incredibly compute-intensive and has restricted functionality. As a result, some queries are not possible on encrypted data. It's one of the least mature, but promising technologies when it comes to anti-money laundering and the detection of double fraud.

2. AI-generated synthetic data

AI-generated synthetic data is one of the most versatile privacy enhancing technologies.  AI-powered synthetic data generators are trained using real data. After the training, the generator can create statistically identical but flexibly sized datasets. Since none of the individual data points match the original data points, re-identification is impossible.

The most popular synthetic data use cases include data anonymization, advanced analytics, AI and machine learning. The process of synthesization also allows for different data augmentation processes. Upsampling rare categories in a dataset can make AI algorithms more efficient. Subsetting large datasets into smaller, but representative batches is useful for software testing. Advanced synthetic data platforms offer statistically representative data imputation and rebalancing features. Since synthetic datasets do not maintain a 1:1 relationship with the original data, subjects are impossible to reidentify. As a result, it's not suitable for use cases where re-identification is necessary.

3. Secure multi-party computation

Secure multi-party computation is an encryption methodology. It allows multiple parties to collaborate on encrypted data. Similarly to homomorphic encryption, the goal here is to keep data private from participants in the computational process. Key management, distributed signatures, and fraud detection are some of the possible use cases here. The limitation of secure multi-party computation is the resource overhead. To pull off a SMPC stunt with success is pretty tricky - everything has to be timed right and processing has to happen synchronously.

4. Federated learning

Federated learning is a specific form of machine learning. Instead of feeding the data into a central model, the data stays on the device and multiple model versions are trained and operated locally. The result of these local trainings are model updates, which get fed back into and improve the central model. This decentralized form of machine learning is especially prevalent in IoT applications.

The training takes place on edge devices, such as mobile phones. Federated learning on its own doesn’t actually protect privacy, only eliminates the need for data sharing in the model training process. However, the fact that data isn’t shared doesn’t mean privacy is safe. The model updates in transitioning from the edge devices could also be hacked and leak privacy. To prevent this, federated learning is often combined with another PET, like differential privacy. 

5. Differential privacy

Differential privacy is not as much a privacy-enhancing technology in itself, but a mathematical definition of privacy. Differential privacy quantifies the privacy leakage that occurs when analyzing a differentially private database. This measure is called the epsilon value. In an ideal world - or with an epsilon value of 0 - the result of said analysis wouldn’t differ no matter whether a given individual is present in the database or not.

The higher the epsilon the more potential privacy leakage can occur. In academia, epsilon values of below 1 are recommended to achieve strong anonymization. In practice, it’s still a challenge to determine a suitable epsilon value. This is important to keep in mind, as differential privacy does not automatically guarantee adequate privacy protection. It simply offers a mathematical guarantee for the upper boundary of potential privacy leakage. So getting the epsilon value set right is of utmost importance. It needs to be low enough to protect privacy, but not so low that the noise that has to be added to achieve this low epsilon value is diminishing data utility.

More often, privacy practitioners use it in combination with another PET, such as federated learning.

Which Privacy Enhancing Technology to use when?

Different privacy-enhancing technologies' benefits and limitations need to be weighed carefully. Some of them are more use case agnostic than others, but most organizations will have to invest in more than one PET to cover all use cases. Some legacy anonymization techniques might also have a place in the data tech stack as additional measures, but their use should be limited.

Which privacy enhancing technology to choose when? Image courtesy of Mobey Forum

The synthetic data guide

If you would like to learn about adding AI-generated synthetic data to your privacy stack, download the complete guide with case studies!

A new, powerful breed of privacy attacks is emerging. One that uses AI to re-identify individuals based on their behavioral patterns. This advent has broad implications for organizations, both from compliance as well as from a risk perspective, as legacy anonymization measures are highly vulnerable. And it’s these risks that drive the surge in demand for privacy-preserving synthetic data, enabled by MOSTLY AI, as a safe and future-proof alternative - even against AI-based re-identification attacks.

The ineffectiveness of data masking

Modern-day privacy regulations, like GDPR and CCPA, consider a dataset to be anonymous, if none of the contained records can “reasonably” be re-identified, i.e. be linked to a natural person or a household. Given that, it is of critical importance to understand how re-identification works, and how it continues to evolve thanks to technological advancements (as is e.g. explicitly required by GDPR recital 26).

There used to be a time, not that long ago, where the masking of direct identifiers, like full names or social security numbers, was deemed to be sufficient to “anonymize” a dataset (see here for a more thorough historical perspective). But it is the simple composition of any of the remaining attributes that allows for the instant re-identification of individual subjects. While masking increases the effort to re-identify manually, and thus might look like an appropriate measure, it doesn’t make it any more difficult for computer-assisted attacks. It’s as simple as making a basic database query to successfully single out individuals within a huge sea of data.

One might even argue that pseudonymization techniques like masking and transformations are harmful, as it instills a false sense of security, leading organizations to risky data sharing practices. Due to an absence of direct identifiers, some individuals without privacy training, might wrongly assume that a redacted dataset is well protected, and share or process accordingly. Security, that is assumed to protect whereas it does not, is the worst possible kind, as it leads an organization to lower its guard.

But aside from lack of knowledge, there is certainly also an intentional ignorance of the problem, that can be encountered if privacy runs counter to commercial interest. Particularly by data brokers, organizations that resell insufficiently anonymized personal data, like mobility or browsing behavior to third parties. They bet on data protection authorities not enforcing the law, and/or on the broader public not caring enough, as they presumably lack the technical expertise. But one can tell that times are changing, if the New York Times, the Guardian, as well as your favorite Late Night host start to pick up the subject.

Figure 1. John Oliver explaining Linkage Attacks to his audience.

The well-established risk of linkage attacks

The previously described type of re-identification is also known as a linkage attack. Linkage attacks work by linking a not-yet-identified dataset (eg. a database of supposedly anonymous medical health records) with some easier-to-obtain auxiliary information on specific individuals (e.g. the day and time that a politician gave birth). The attack is then simply performed by looking for overlapping matches between the common attributes of these two sources of information. Once such a match is found, the direct identifiers can be attributed to the supposedly anonymous data records. In the previously stated example, finding a subject that gave birth at the same date and time as the politician, would then allow to attribute all the other medical records of that subject to the named politician - even though no direct identifiers were contained in the accessed database. Anyone with a basic knowledge of data querying techniques can perform such a “hack”, thus it is certainly “reasonably” likely to be performed by a malicious actor.

linkage data privacy attack
Figure 2. Linkage Attacks rely on an overlap of the data points of a released dataset, and some identified auxiliary data.

But linkage attacks are by far not only a concern for politicians and other prominent individuals in your customer database. They are similarly easy to perform on people like you and me. Other prominent examples of this type of attack include the re-identification of NY taxi trips, the re-identification of telco location data, the re-identification of credit card transactions, the re-identification of browsing data, the re-identification of health care records, and so forth. Also when turning towards the prominent case of re-identified Netflix users, we see a type of linkage attack being deployed. There the notable difference is, that Netflix had actually tried to prevent attacks by not only removing all user attributes, but also by adding random noise to obfuscate single records. However, as it turned out, these were all still ineffective, and a linkage attack based on fuzzy matches could be easily performed.

The new rise of powerful profiling attacks

Enter a new breed of even more capable privacy attacks, that leverage AI to re-identify individuals based on their behavioral patterns: profiling attacks. While conceptually it has been known that these types of profiling attacks are possible, their feasibility and ease of implementation has only recently been demonstrated in peer-reviewed papers. Firstly, and most prominently, by a group of leading privacy researchers, including Yves-Alexandre de Montjoye, from the Imperial College London in their recent Nature paper. There they showcase how to successfully re-identify call data records purely based on the implicit relationships between subjects, i.e. on the graph topology. Secondly, joint research by the Vienna University of Economics and Business and MOSTLY AI, demonstrated the applicability of the approach in their paper on re-identifying browsing patterns.

AI-based profiling data privacy attack
Figure 3. Profiling Attacks do NOT require an overlap of data points between a released dataset, and some identified auxiliary data.

The basic idea is simple, and borrows from modern-day face recognition algorithms. An AI model is trained specifically for the re-identification of subjects, by tasking it to correctly match a randomly selected anchor sample (e.g., an image of Arnold Schwarzenegger) with any of two alternative samples, whereas only one stemmed from the same subject (i.e., another image of Arnold, plus one from a different actor). See Figure 4 for a basic illustration of the concept - for faces, for signatures, and for browsing behavior. In all of these applications the model has to learn to extract the characteristic traits, the uniquely identifying patterns, the “identifying fingerprint” of a data record, while disregarding any other irrelevant information. That characteristic information can then be distilled from any new data in the form of a numeric vector, which then allows to define a distance measure between records of individual subjects. Equipped with that, the profiling attack itself is subsequently as simple as looking for the nearest neighbor of the identified auxiliary data record within the not-yet-identified database.

AI-based privacy attack via triplet-loss learning
Figure 4. AI-based Re-Identification via Triplet-Loss Learning

What is truly remarkable and has a significant impact on the scope of privacy regulations is the efficiency of this methodology. Even though neither geographic, nor temporal, nor subject-level information, nor any overlapping event data have been available, the researchers were able to successfully re-identify the majority of subjects with a generic, domain-agnostic approach. One that works for re-identifying faces, signatures, as well as any sequence of tabular data. The authors further demonstrated the robustness of the method. Creţu et al. showed that the characteristic relations within call data records remained stable across several months, thus allowing re-identification based on data collected at a significantly later stage, casting major concerns on current data retention policies. And Vamosi et al., on the other hand, showed the robustness towards data perturbations. Even in cases where a third of the data points were completely randomly substituted, the re-identification algorithm found the correct match 27% of the time in a pool of thousands of candidates. Thus, the AI-based re-identification is shown to be highly robust against noise. If we expand the search to find matches within the Top 10 or Top 100 nearest neighbors, the success rate goes up significantly . This also means that just a single additional, seemingly innocuous data point - like age or zip code - will likely result in a perfect match once combined with the power of a profiling attack.

Synthetic data is immune to AI-based re-identification attacks

The three basic techniques applied by legacy anonymization solutions are 1) the removal of attributes, 2) the generalization of attributes, and 3) the obfuscation or transformation of attributes. However, by now we have arrived in an era where dozens, hundreds, if not thousands of data points are being gathered for each and every individual which together result in these unique digital fingerprints that make it ridiculously easy for AI to find matching behavioral patterns. The more attributes of an individual are captured, the more it stands out in today's high-dimensional data spaces. And it is due to this mathematical law of high dimensions, that any of these legacy anonymization methods fail to offer protection against linkage and profiling attacks unless they destroy almost the entirety of the contained information.

Thus, leading organizations, that recognize the business value of customer trust, stop the risky practice of transferring actual production data into non-production environments. The data of a customer shall ideally only be used for serving that actual customer. For all other purposes they start to break the susceptible 1:1 link to actual data subjects, and adopt statistically representative synthetic data at scale.

Yet, as we’ve also demonstrated before, synthetic data is not automatically private by design. It needs to be properly empirically vetted. The distance measure from the newly introduced AI-based profiling attacks now provides one of the strongest possible assessments of the privacy of synthetic behavioral data. And with that, it is shown that synthetic data by MOSTLY AI - thanks to its range of in-built privacy mechanisms - is truly privacy-preserving. And thus fully anonymous under GDPR, CCPA and in the strictest possible sense.

Hence, the news is out: The time for legacy anonymization is up and privacy-preserving synthetic data is the future. If you are ready to embark on that future, don’t hesitate to contact us, and we are happy to onboard you to MOSTLY AI - the leader in structured synthetic data.

Credits: The research collaboration between WU Wien and MOSTLY AI is supported by the "ICT of the Future” funding programme of the Austrian Federal Ministry for Climate Action, Environment, Energy, Mobility, Innovation and Technology.

TL;DR: Synthetic financial data is the fuel banks need to become AI-first and to create cutting-edge services. In this report, you can read about:

  • banking technology trends in 2022 from superapps to personalized digital banking
  • data privacy legislations affecting the banking industry in 2022
  • the challenges in AI/ML development, testing and data sharing that synthetic data can solve
  • the most valuable data science and synthetic data use cases in banking: customer acquisition and advanced analytics, mortgage analytics, credit decisioning and limit assessment, risk management and pricing, fraud and anomaly detection, cybersecurity, monitoring and collections, churn reduction, servicing and engagement, enterprise data sharing, and synthetic test data for digital banking product development
  • synthetic data engineering: how to integrate synthetic data in financial data architectures

Banks and financial institutions are aware of their data and innovation gaps and AI-generated synthetic data is their best bet. According to Gartner:

By 2030, 80 percent of heritage financial services firms will go out of business, become commoditized or exist only formally but not competing effectively.

A pretty dire prophecy, but nonetheless realistic, with small neobanks and big tech companies eyeing their market. There is no way to run but forward. 

The future of banking is all about becoming AI-first and creating cutting-edge digital services coupled with tight cybersecurity. In the race to a tech-forward future, most consultants and business prophets forget about step zero: customer data. In this blog post, we will give an overview of the data science use cases in banking and attempt to offer solutions throughout the data lifecycle. We'll concentrate on the easiest to deploy and highest value synthetic data use cases in banking. We'll cover three clusters of synthetic data use cases: data sharing, AI, advanced analytics, machine learning, and software testing. But before we dive into the details, let's talk about the banking trends of today.

The pandemic accelerated digital transformation, and the new normal is here to stay. According to Deloitte, 44% of retail banking customers use their bank's mobile app more often. At Nubank, a Brazilian digital bank, the number of accounts rose by 50%, going up to 30 million. It is no longer the high-street branch that will decide the customer experience. Apps become the new high-touch, flagship branches of banks where the stakes are extremely high. If the app works seamlessly and offers personalized banking, customer lifetime value increases. If the app has bugs, frustration drives customers away. Service design is an excellent framework for creating distinctive personalized digital banking experiences. Designing the data is where it should all start.

A high-quality synthetic data generator is one mission-critical piece of the data design tech stack. Initially a privacy-enhancing technology, synthetic data generators can generate representative copies of datasets. Statistically the same, yet none of the synthetic data points match the original. Beyond privacy, synthetic data generators are fantastic data augmentation tools too. Synthetic data is the modeling clay that makes this data design process possible. Think moldable test data and training data for machine learning models based on real production data.

Download the Banking on synthetic data ebook!

Hands on advice from industry experts and a complete collection of synthetic data use cases in banking.

The rise of superapps is another major trend financial institutions should watch out for. Building or joining such ecosystems makes absolute sense if banks think of them as data sources. Data ecosystems are also potential spaces for customer acquisition. With tech giants entering the market with payment and retail banking products, data protectionism is rising. However, locking up data assets is counterproductive, limiting collaboration and innovation. Sharing data is the only way to unlock new insights. Especially for banks, whose presence in their customers' lives is not easy to scale unless via collaborations and new generation digital services. Insurance providers and telecommunications companies are the first obvious candidates. Other beyond-banking service providers could also be great partners, from car rental companies to real estate services, legal support, and utility providers. Imagine a mortgage product that comes with a full suite of services needed throughout a property purchase. Banks need to create a frictionless, hyper-personalized customer experience to harness all the data that comes with it. 

Another vital part of this digital transformation story is AI adoption. In banking, it's already happening. According to McKinsey,

"The most commonly used AI technologies (in banking) are: robotic process automation (36 percent) for structured operational tasks; virtual assistants or conversational interfaces (32 percent) for customer service divisions; and machine learning techniques (25 percent) to detect fraud and support underwriting and risk management."

It sounds like banks are running full speed ahead into an AI future, but the reality is more complicated than that. Due to the legacy infrastructures of financial institutions, the challenges are numerous. Usually, there is no clear strategy or fragmented ones with no enterprise-wide scale. Different business units operate almost completely cut off with limited collaboration and practically no data sharing. These fragmented data assets are the single biggest obstacle to AI adoption. McKinsey estimates that AI technologies could potentially deliver up to $1 trillion of additional value in banking each year. It is well worth the effort to unlock the data AI and machine learning models so desperately need. Let's take a look at the number one reason or rather excuse banks and financial institutions hide behind when it comes to AI/AA/ML innovation: data privacy.

The state of data privacy in banking in 2022

Banks have always been the trustees of customer privacy. Keeping data and insights tightly secured has prevented banks from becoming data-centric institutions. What's more, an increasingly complex and restrictive legislative landscape makes it difficult to comply globally.

The European data privacy landscape in 2022

Let’s be clear. The ambition to secure customer data is the right one. Banks must take security seriously, especially in an increasingly volatile cybersecurity environment. However, this cannot take place at the expense of innovation. The good news is that there are tools to help. Privacy-enhancing technologies (PETs) are crucial ingredients of a tech-forward banking capability stack. It's high time for banking executives, CIOs, and CDOs to get rid of their digital banking blindspots. Banks must stop using legacy data anonymization tools that endanger privacy and hinder innovation. Data anonymization methods, like randomization, permutation, and generalization, carry a high risk of re-identification or destroy data utility.

Maurizio Poletto, Chief Platform Officer at Erste Group Bank AG, said in The Executive's Guide to Accelerating Artificial Intelligence and Data Innovation with Synthetic Data:

"In theory, in banking, you could take real account data, scramble it, and then put it into your system with real numbers, so it's not traceable. The problem is that obfuscation is nice, and anonymization is nice, but you can always find a way to get the original data back. We need to be thorough and cautious as a bank because it is sensitive data. Synthetic data is a good way to continue to create value and experiment without having to worry about privacy, particularly because society is moving toward better privacy. This is just the beginning, but the direction is clear."

Modern PETs include AI-generated synthetic data, homomorphic encryption, or federated learning. They offer the way out of the data dilemma in banking. Data innovators in banking should choose the appropriate PET for the appropriate use case. Encryption solutions should be looked at when necessary to unencrypt the original data. Anonymized computation, such as federated learning, is a great choice when models can get trained on users' mobile phones. AI-generated synthetic data is the most versatile privacy-enhancing technology with just one limitation. Synthetic datasets generated by AI models trained on original data cannot be reverted back to the original. Synthetic datasets are statistically identical to the original datasets they were modeled on. However, there is no 1:1 relationship between the original and the synthetic data points. This is the very definition of privacy. As a result, AI-generated synthetic data is great for specific use cases—advanced analytics, AI and machine learning training, software testing, and sharing realistic but unencryptable datasets. Synthetic data is not a good choice for use cases where the data needs to be reverted back to the original, such as information sharing for anti-money laundering purposes, where perpetrators need to be re-identified. Let's look at a comprehensive overview of the most valuable synthetic data use cases in banking!

The most valuable synthetic data use cases in banking

Synthetic data generators come in many shapes and forms. In the following, we will be referring to MOSTLY AI's synthetic data generator. It is the market-leading synthetic data solution able to generate synthetic data with high accuracy. MOSTLY AI's synthetic data platform comes with advanced features, such as direct database connection and the ability to synthesize complex data structures with referential integrity. As a result, MOSTLY AI can serve the broadest range of use cases with suitably generated synthetic data. In the following, we will detail the lowest hanging synthetic data fruits in banking. These are the use cases we have seen to work well in practice and generate a high ROI.

CHALLENGESHOW CAN SYNTHETIC DATA HELP?
AI/AA/ML
  • Data locked away for privacy reasons
  • Training data quality is suboptimal due to legacy anonymization
  • Training data is erroneous due to embedded historic bias
  • Model performance is not good enough to be put into production
  • Domain knowledge is missing due to restricted data
  • Mortgage analytics models miss out on next-generation data assets, such as transaction data and location data

  • Synthetic data can be used freely
  • Synthetic training data is as good as real with up to 99% accuracy
  • Synthetic data generation can fix  biases
  • Upsampling via synthesization can improve ML performance
  • Synthetic data can be injected into models and linked to other datasets
  • Synthetic geolocation data from mobile service providers and accurate synthetic behavioral data of transactions are compliant and accurate
TESTING
  • Production data is off-limits for privacy reasons
  • Manual data generation misses business rules
  • Data can't be shared with third-party test teams or other lines of business
  • Fragmented data for testing omnichannel journeys
  • Test environments are slow to build (40+ days)
  • Complex database structures are impossible to recreate with referential integrity

  • Synthetic data can simulate production data accurately
  • Synthetic data implicitly picks up on all business rules
  • Synthetic data is free to share even across borders and cross-teams tests
  • Synthetic data can be shared to create omnichannel testing stories
  • Synthetic test data generation is fast and on-demand, shortening sprints
  • Multi table synthetic data can be created with referential integrity intact
DATA SHARING
  • Regulations prohibit cross-border data sharing
  • Vendor selection is suboptimal due to lack of bank-specific test data
  • Distinct business lines operate siloed data reserves

  • Synthetic data is not personal data and is free to share across borders
  • Synthetic data is free to share with third parties and provides realism
  • A synthetic data sandbox provides access to data across the enterprise for a 360-degree customer view
The 15 highest value synthetic data use cases in banking

Synthetic data for AI, advanced analytics, and machine learning

Synthetic data for AI/AA/ML is one of the richest use case categories with many high-value applications. According to Gartner, by 2024, 60% of the data used for the development of AI and analytics projects will be synthetically generated. Machine learning and AI unlocks a range of business benefits for retail banks.

Automated, personalized decisions across the entire enterprise can increase competitiveness. The data backbone, the appropriate tools, and talent need to be in place to make this happen. Synthetic data generation is one of those capabilities essential for an AI-first bank to develop. The reliability and trustworthiness of AI is a neglected issue. According to Gartner:

65% of companies can't explain how specific AI model decisions or predictions are made. This blindness is costly. AI TRiSM tools, such as MOSTLY AI's synthetic data platform, provide the Trust, Risk and Security Management needed for effective explainability, ModelOps, anomaly detection, adversarial attack resistance and data protection. Companies need to develop these new capabilities to serve new needs arising from AI adoption.

From explainability to performance improvement, synthetic data generators are one of the most valuable building tools. Data science teams need synthetic data to succeed with AI and machine learning use cases. Here is how to use synthetic data in the most common AI banking applications.

CUSTOMER ACQUISITION AND ADVANCED ANALYTICS

CRM data is the single most valuable data asset for customer acquisition and retention. A wonderful, rich asset that holds personal data and behavioral data of the bank's future prospects. However, due to privacy legislation, up to 80% of CRM data tends to be locked away. Compliant CRM data for advanced analytics and machine learning applications is hard to come by. Banks either comply with regulations and refrain from developing a modern martech platform altogether or break the rules and hope to get away with it. There is a third option. Synthetic customer data is as good as real when it comes to training machine learning models. Insights from these type of analytics can help identify new prospects and improve sign-up rates significantly.

MORTGAGE ANALYTICS, CREDIT DECISIONING AND LIMIT ASSESSMENT

AI in lending is a hot topic in finance. Banks want to reach out to the right people with the right mortgage and credit products. In order to increase precision in targeting, a lot of personal data is needed. The more complete the customer data profile, the more intelligent mortgage analytics becomes. Better models bring lower risk both for the bank and for the customer. Rule-based or logistic regression models rely on a narrow set of criteria for credit decision-making. Banks without advanced behavioral analytics and models underserve a large segment of customers. People lacking formal credit histories or deviating from typical earning patterns are excluded. AI-first banks utilize huge troves of alternative data sources. Modern data sources include social media, browsing history, telecommunications usage data, and more. However, using these highly personal data sources in their original for training AI models is often a challenge. Legacy data anonymization techniques destroy the very insights the model needs. Synthetic data versions retain all of these insights. Thanks to the granular, feature-rich nature of synthetic data, lending solutions can use all the intelligence.

RISK MANAGEMENT AND PRICING

Pricing and risk prediction models are some of the most important models to get right. Even a small improvement in their performance can lead to significant savings and/or higher revenues. Injecting additional domain knowledge into these models, such as synthetic geolocation data or synthetic text from customer conversations, significantly improves the model's ability to quantify a customer's propensity to default. MOSTLY AI's ability to provide the accuracy needed to generate synthetic geolocation data has been proven already. Synthetic text data can be used for training machine learning models in a compliant way on transcripts of customer service interactions. Virtual loan officers can automate the approval of low-risk loans reliably.

It is also mission-critical to be able to provide insight into the behavior of these models. Local interpretability is the best approach for explainable AI today, and synthetic data is a crucial ingredient of this transparency.  

FRAUD AND ANOMALY DETECTION

Fraud is one of the most interesting AI/ML use cases. Fraud and money laundering operations are incredibly versatile, getting more and more sophisticated every day. Adversaries are using a lot of automation too to find weaknesses in financial systems. It's impossible to keep up with rule-based systems and manual follow-ups. False positives cost a lot of money to investigate, so it's imperative to continuously improve precision aided with machine learning models. To make matters even more challenging, fraud profiles vary widely between banks. The same recipe for catching fraudulent transactions might not work for every financial institution. Using machine learning to detect fraud and anomaly patterns for cybersecurity is one of the first synthetic data use cases banks usually explore. The fraud detection use case goes way beyond privacy and takes advantage of the data augmentation possibility during synthesization. Maurizio Poletto, CPO at Erste Group Bank, recommends synthetic data upsampling to improve model performance:

Synthetic data can be used to train AI models for scenarios for which limited data is available—such as fraud cases. We could take a fraud case using synthetic data to exaggerate the cluster, exaggerate the amount of people, and so on, so the model can be trained with much more accuracy. The more cases you have, the more detailed the model can be.

Training and retraining models with synthetic data can improve fraud detection model performance, leading to valuable savings on investigating false positives.

MONITORING AND COLLECTIONS

Transaction analysis for risk monitoring is one of the most privacy-sensitive AI use cases banks need to be able to handle. Apart from traditional monitoring data, like repayment history and credit bureau reports, banks should be looking to utilize new data sources, such as time-series bank data, complete transaction history, and location data. Machine learning models trained with these extremely sensitive datasets can reliably microsegment customers according to value at risk and introduce targeted interventions to prevent defaults. These highly sensitive and valuable datasets cannot be used for AI/ML training without effective anonymization. MOSTLY AI's synthetic data generator is one of the best on the market when it comes to synthesizing complex time-series, behavioral data, like transactions with high accuracy. Behavioral synthetic data is one of the most difficult synthetic data categories to get right, and without a sophisticated AI engine, like MOSTLY AI's, results won't be accurate enough for such use cases.

CHURN REDUCTION, SERVICING, AND ENGAGEMENT

Another high-value use case for synthetic behavioral data is customer retention. A wide range of tools can be put to good use throughout a customer's lifetime, from identifying less engaged customers to crafting personalized messages and product offerings. The success of those tools hinges on the level of personalization and accuracy the initial training data allows. Machine learning models are the most powerful at pattern recognition. ML's ability to identify microsegments no analyst would ever recognize is astonishing, especially when fed with synthetic transaction data. Synthetic data can also serve as a bridge of intelligence between different lines of business: private banking and business banking data can be a powerful combination to provide further intelligence, but strictly in synthetic form. The same applies to national or legislative borders: analytics projects with global scope can be a reality when the foundation is 100% GDPR compliant synthetic data. 

ALGORITHMIC TRADING

Financial institutions can use synthetic data to generate realistic market data for training and validating algorithmic trading models, reducing the reliance on historical data that may not always represent future market conditions. This can lead to improved trading strategies and increased profitability.

STRESS TESTING

Banks can use synthetic data to create realistic scenarios for stress testing, allowing them to evaluate their resilience to various economic and financial shocks. This helps ensure the stability of the financial system and boosts customer confidence in the institution's ability to withstand adverse conditions.

Synthetic data for enterprise data sharing

Open financial data is the ultimate form of data sharing. According to McKinsey, economies embracing financial data sharing could see GDP gains of between 1 and 5 percent by 2030, with benefits flowing to consumers and financial institutions. More data means better operational performance, better AI models, more powerful analytics, and customer-centric digital banking products facilitating omnichannel experiences. The idea of open data cannot become a reality without a robust, accurate, and safe data privacy standard shared by all industry players in finance and beyond. This is a vision shared by Erste Group Bank's Chief Platform Officer:

Imagine if we in banking use synthetic data to generate realistic and comparable data from our customers, and the same thing is done by the transportation industry, the city, the insurance company, and the pharmaceutical company, and then you give all this data to someone to analyze the correlation between them. Because the relationship between well-being, psychological health, and financial health is so strong, I think there is a fantastic opportunity around the combination of mobility, health, and finance data.

It's an ambitious plan, and like all grand designs, it's best to start building the elements early. At this point, most banks are still struggling with internal data sharing with distinct business lines acting as separate entities and being data protectionist when open data is the way forward. Banks and financial institutions share little intelligence, citing data privacy and legislation as their main concern. However, data sharing might just become an obligation very soon with the EU putting data altruism on the map in the upcoming Data Governance Act. While sharing personal data will remain strictly forbidden and increasingly so, anonymized data sharing will be expected of companies in the near future. In the U.S., healthcare insurance companies and service providers are already legally bound to share their data with other healthcare providers. The same requirement makes a lot of sense in banking too where so much depends on credit history and risk prediction. While some data is shared, intelligence is still withheld. Cross-border data sharing is also a major challenge in banking. Subsidiaries either operate in a completely siloed way or share data illegally. According to Axel von dem Bussche, Partner at Taylor Wessing and IT lawyer, as much as 95% of international data sharing is illegal due to the destruction of the EU-US Privacy Shield by the Schrems II decision.

Some organizations fly analysts and data scientists to the off-shore data to avoid risky and forbidden cross-border data sharing. It doesn't have to be this complicated. Synthetic data sharing is compliant with all privacy laws across the globe. Setting up synthetic data sandboxes and repositories can solve enterprise-wide data sharing across borders since synthetic data does not qualify as personal data. As a result, it is out of scope of GDPR and the infamous Schrems II. ruling, which effectively prohibited all sharing of personal data outside the EU.

Third-party data sharing within the same legislative domain is also problematic. Banks buy many third-party AI solutions from vendors without adequately testing the solutions on their own data. The data used in procurement processes is hard to get, causing costly delays and heavily masked to prevent sensitive data leaks through third parties. The result is often bad business decisions and out-of-the-box AI solutions that fail to deliver the expected performance. Synthetic data sandboxes are great tools for speeding up and optimizing POC processes, saving 80% of the cost.

Synthetic test data for digital banking products

One of the most common data sharing use cases is connected to developing and testing digital banking apps and products. Banks accumulate tons of apps, continuously developing them, onboarding new systems, and adding new components. Manually generated test data for such complex systems is a hopeless task, and many revert to the old dangerous habit of using production data for testing systems. Banks and financial institutions tend to be more privacy-conscious, but their solutions to this conundrum are still suboptimal. Time and time again, we see reputable banks and financial institutions roll out apps and digital banking services after only testing them with heavily masked or manually generated data. One-cent transactions and mock data generators won't get you far when customer expectations for seamless digital experiences are sky-high.

To complicate things further, complex application development is rarely done in-house. Data owners and data consumers are not the same people, nor do they have the full picture of test scenarios and business rules. Labs and third-party dev teams rely on the bank to share meaningful test data with them, which simply does not happen. Even if testing is kept in-house, data access is still problematic. While in other, less privacy-conscious industries, developers and test engineers use radioactive test data in non-production environments, banks leave testing teams to their own devices. Manual test data generation with tools like Mockaroo and the now infamous Faker library misses most of the business rules and edge cases so vital for robust testing practices. Dynamic test users for notification and trigger testing are also hard to come by. To put it simply, it's impossible to develop intelligent banking products without intelligent test data. The same goes for the testing of AI and machine learning models. Testing those models with synthetically simulated edge cases is extremely important to do when developing from scratch and when recalibrating models to avoid drifting. Models are as good as the training data, and testing is as good as test data. Payment applications with or without personalized money management solutions need the synthetic approach: realistic synthetic test data and edge case simulations with dynamic synthetic test users. Synthetic test data is fast to generate and can create smaller or larger versions of the same dataset as needed throughout the testing pyramid from unit testing, through integration testing, UI testing to end-to-end testing.

Erste Bank's main synthetic data use case is test data management. The bank is creating synthetic segments and communities, building new features, and testing how certain types of customers would react to these features.

Normally, the data we use is static. We see everything from the past. But features like notifications and triggers—like receiving a notification when your salary comes in—can only be tested with dynamic test users. With synthetic data, you push a button to generate that user with an unlimited number of transactions in the past and a limited number of transactions in the future, and then you can put into your system a user which is alive.

These live, synthetic users can stand in for production data and provide a level of realism unheard of before while protecting customers' privacy. The Norwegian Data Protection Authority issued a fine for using production data in testing, adding that using synthetic data instead would have been the right course to take.

Testing is becoming a continuous process. Deploying fast and iterating early is the new mantra of DevOps teams. Setting up CI/CD (continuous integration and delivery) pipelines for continuous testing cannot happen without a stable flow of high-quality test data. Synthetic data generators trained on real data samples can provide just that – up-to-date, realistic, and flexible data generation on-demand.

How to integrate synthetic data generators into financial systems?

First and foremost, it's important to understand that not all synthetic data generators are created equal. It's particularly important to select the right synthetic data vendor who can match the financial institution's needs. If a synthetic data generator is inaccurate, the resulting synthetic datasets can lead your data science team astray. If it's too accurate, the generator overfits or learns the training data too well and could accidentally reproduce some of the original information from the training data. Open-source options are also available. However, the control over quality is fairly low. Until a global standard for synthetic data arrives, it's important to proceed with caution when selecting vendors. Opt for synthetic data companies, which already have extensive experience with sensitive financial data and know-how to integrate synthetic data successfully with existing infrastructures.

synthetic data generation in banking
Synthetic data engineering in banking

The future of financial data is synthetic

Our team at MOSTLY AI is working with large banks and financial organizations very closely. We know that synthetic data will be the data transformation tool that will change the financial data landscape forever, enabling the flow and agility necessary for creating competitive digital services. While we know that the direction is towards synthetic data across the enterprise, we know full well how difficult it is to introduce new technologies and disrupt the status quo in enterprises, even if everyone can see the benefits. One of the most important tasks of anyone looking to make a difference with synthetic data is to prioritize use cases in accordance with the needs and possibilities of the organization. Analytics use cases with the biggest impact can serve as flagship projects, establishing the foundations of synthetic data adoption. In most organizations, mortgage analytics, pricing, and risk prediction use cases can generate the highest immediate monetary value, while synthetic test data can massively accelerate the improvement of customer experience and reduce compliance and cybersecurity risk. It's good practice to establish semi-independent labs for experimentation and prototyping: Erste Bank's George Lab is a prime example of how successful digital banking products can be born of such ventures. The right talent is also a crucial ingredient of success. According to Erste Bank's CPO, Maurizio Poletto:

Talented data engineers want to spend 100% of their time in data exploration and value creation from data. They don't want to spend 50% of their time on bureaucracy. If we can eliminate that, we are better able to attract talent. At the moment, we may lose some, or they are not even coming to the banking industry because they know it's a super-regulated industry, and they won't have the same freedom they would have in a different industry.

Once you have the attraction of a state-of-the-art tech stack enabling agile data practices, you can start building cross-functional teams and capabilities across the organization. The data management status quo needs to be disrupted, and privacy, security, and data agility champions will do the groundwork. Legacy data architectures keeping banks and financial institutions back from innovating and endangering customers' privacy need to be dealt with soon. The future of data-driven banking is bright, and that future is synthetic.

Synthetic data in banking ebook

Would you like to know more about using synthetic data in banking?

Synthetic data is quickly becoming a critical tool for organizations to unlock the value of sensitive customer data while keeping the privacy of their customers protected and in compliance with data protection regulations such as GDPR and CCPA. It can be generated quickly in abundance and has been proven to drastically improve machine learning performance. As a result, it is often used for advanced analytics and AI training, such as predictive algorithms, fraud detection and pricing models.

According to Gartner, by 2024, 60% of the data used for the de­vel­op­ment of AI and an­a­lyt­ics projects will be syn­thet­i­cally gen­er­ated.  

MOSTLY AI pioneered the creation of synthetic data for AI model development and software testing. With things moving so quickly in this space here are three trends that we see happening in AI and synthetic data in 2022:

1. Bias in AI will get worse before it gets better.

Most of the machine learning and AI algorithms currently in production, interacting with customers, making decisions about people have never been audited for fairness and discrimination, the training data has never been augmented to fix embedded biases. It is only through massive scandals that companies are finding out and learning the hard way that they need to pay more attention to biased data and to use fair synthetic data instead.

Regulations all over the world are getting stricter every day; many countries have a personal data protection policy in place by now. Using customer data is getting increasingly difficult for a number of other reasons too - people are more privacy-conscious and are increasingly likely to refuse consent to using their data for analytics purposes. So companies literally run out of relevant and usable data assets. Companies will learn to understand that synthetic data is the way out of this dilemma.

3. Synthetic data will be standardized with globally recognized benchmarks for privacy and accuracy.

Not all synthetic data is created equal. To start off with, there is a world of difference between what we call structured and unstructured synthetic data. Unstructured data means images and text for example, while structured data is mainly tabular in nature. There are lots of open source and proprietary synthetic data providers out there for both kinds of synthetic data and the quality of their generators varies widely. It’s high time to establish a synthetic data standard to make sure that synthetic data users get consistently high-quality synthetic data. We are already working on structured synthetic data standards. 

If you’d like to connect on these trends, we’re happy to set up an interview or write a byline on these topics for your publication.  Please let us know - thanks.

magnifiercross