What is data scrubbing?
In the era of big data and data-driven decision-making, protecting the privacy of individuals remains a top priority for businesses and organizations. Data anonymization is a widely used method to achieve this by aiming to remove personal identifiable information (PII) from datasets. One term that is frequently used is "data scrubbing", also referred to as "PII scrubbing". It gives the impression that it’s possible to just “wash off” personal information from a dataset like it's some kind of dirt. Unfortunately, this is far from the truth and it’s time we stop using the term data scrubbing.
What is Personal Identifiable Information (PII)?
Personal Identifiable Information (PII) refers to any data that can be used to identify an individual, either directly or indirectly. Examples of personal identifiers are names, addresses, phone numbers, email addresses, social security numbers, and IP addresses. It should be clear to anyone that this information can be used to directly identify an individual.
But PII does not include only personal identifiers; it also includes indirect personal identifiers. These identifiers, also known as quasi-identifiers, are pieces of information that do not directly identify an individual but, when combined with other data, may lead to the identification of a person. These identifiers also pose a risk to privacy, as they can potentially be used to re-identify individuals. Here are some examples of indirect personal identifiers:
- Date of birth: While not unique on its own, when combined with other quasi-identifiers, such as postal code or gender, it can narrow down the search for an individual's identity.
- Postal code: A postal code by itself is not enough to identify a person, but when combined with other quasi-identifiers, it can help locate an individual within a specific area.
- Occupation: Knowing someone's occupation does not reveal their identity directly, but when combined with other information, it can significantly increase the risk of re-identification.
- Gender: Gender information alone cannot uniquely identify an individual. However, when combined with other quasi-identifiers, it may contribute to identifying a person.
- Ethnicity: Similar to gender, ethnicity data can be a contributing factor in re-identification when combined with other quasi-identifiers.
- Education level: Education level information, while not unique to an individual, can potentially lead to identification when combined with other data points.
The point is – while the date of birth or postal code by itself is most likely not an issue, the combination of quasi-identifiers is what bears the privacy risks. And it gets even more complicated.
In today's data-driven world, companies and organizations collect a vast number of data points per person. Social media activity, online purchases, browsing history, and location data are just a few examples. These data points, when combined with quasi-identifiers, can significantly increase the risk of re-identification, even in supposedly “anonymized” datasets.
As data collection continues to expand and more advanced data analysis techniques emerge, the big challenge is that virtually any data point must be considered a quasi-identifier.
How traditional data scrubbing works
Traditional data scrubbing techniques focus on removing or obfuscating PII from datasets. Some common methods include:
- Data Deletion: Just simply deleting PII data from a dataset.
- Data Masking: Replacing PII with random characters or symbols to maintain the original data format.
- Data Generalization: Grouping similar data points into categories or ranges to reduce the specificity of the data.
- Data Perturbation: Adding random noise to the original data to make it less accurate, while preserving its overall structure and utility.
- Data Swapping: Exchanging PII values between records to maintain statistical properties while breaking the link between the data and the individual.
The intent is good. Anonymization techniques try to help protect individual privacy while ensuring compliance with data protection regulations such as GDPR, CCPA, and HIPAA.
However, the cons of traditional anonymization and data scrubbing cannot be ignored anymore:
- Data Quality: Data scrubbing typically reduces the accuracy and granularity of the data, which can impact the insights derived from it.
- Time and Resource Intensive: Data scrubbing can be a complex and labor-intensive process, especially when dealing with large and diverse datasets.
- Re-identification Risk: Advanced data analysis and linkage techniques can potentially re-identify “anonymized” data, compromising individual privacy.
Take, for example, the case of the "Netflix Prize" competition in 2006. Netflix released an anonymized dataset containing 100 million movie ratings from 500,000 subscribers, with the goal of improving its movie recommendation algorithm. While the dataset was stripped of any direct identifiers, researchers were able to re-identify individuals by cross-referencing the anonymized data with publicly available information from the Internet Movie Database (IMDb). The researchers were able to uncover the personal preferences and movie ratings of some Netflix users, demonstrating the potential risk of combining quasi-identifiers and publicly available data.
Another famous example is the re-identification of Massachusetts Governor William Weld's medical records in the 1990s. The state had released “anonymized” hospital visit data, but a researcher was able to re-identify the Governor by combining the data with publicly available voter registration records. The researcher used the Governor's date of birth, gender, and zip code to successfully pinpoint his medical records, emphasizing the importance of carefully considering the release and use of datasets containing quasi-identifiers.
Synthetic data generation vs data scrubbing
These are the two inherent limitations and risks associated with data scrubbing techniques: reduced data quality and the potential for re-identification. The idea of cleansing of data to get rid of personal information is fundamentally flawed in an era of big data. As technology continues to advance, organizations - such as banks, insurance providers, and telcos - must remain vigilant in exploring new approaches to data privacy.
One such approach is AI-generated synthetic data, which can be used to create realistic but completely fictitious data that can be used for analysis without risking the privacy of individuals. Synthetic data can be generated using machine learning algorithms that learn patterns from real data and then use these patterns to create new data that is statistically similar but contains no identifiable personal information.
AI-generated synthetic data has the potential to revolutionize how organizations handle sensitive data. The use of synthetic data can help organizations overcome the limitations of traditional data scrubbing techniques, which often require significant time and resources to execute effectively. Moreover, AI-generated synthetic data can be used to create new datasets that are larger, more diverse, and more representative of the real-world population, making it an excellent resource for machine learning and other data-driven applications.
Synthetic data can be generated quickly and efficiently. In contrast, traditional data scrubbing techniques can be time-consuming and labor-intensive, requiring trained professionals to review and manually edit datasets. The use of synthetic data can also help reduce the risk of data breaches since it is not associated with any real individual's personal information.
Another advantage of synthetic data is that it can help address the issue of data bias, which can occur when a dataset does not accurately represent the real-world population. This can happen, for example, when a dataset only includes data from a specific geographic location or demographic group. By generating synthetic data that is statistically similar to the original data but includes a more diverse range of attributes, organizations can reduce bias and ensure that their analyses are more representative of the broader population.
If you need to generate high-quality synthetic data quickly and efficiently, MOSTLY AI is the pioneering leader in synthetic data generation offering unparalleled accuracy. Additionally, MOSTLY AI takes data privacy and security seriously, so you can be confident that your data is safe and secure. Interesting in taking it for a spin? Sign up now to generate up to 100K rows of high-quality synthetic data daily for FREE.