Why Synthetic Data?
Why Classic Anonymization Fails for Big Data
It Destroys Valuable information
In an attempt to protect privacy, classic anonymization techniques (like data masking or obfuscating) destroy most of the valuable information in your datasets. This significantly reduces their utility for sophisticated AI and big data use cases.
It makes it easy to re-identify data subjects
In the era of big data, classic anonymization techniques fail to protect against de-anonymization. Researchers have demonstrated over and over again how easy it is to re-identify data subjects in these supposedly anonymous datasets. For example, 80% of credit card owners can be re-identified by only 3 transactions. Thus, relying on these outdated techniques puts your business at regulatory, reputational, and financial risk.
of mobile phone owners are re-identified simply by 2 antenna signals, even when coarsened to the hour of the day
of credit card owners are re-identified by 3 transactions, even when only merchant and the date of transaction is revealed
of all people are re-identified, merely by their date-of-birth, their gender and their ZIP code of residence
Watch the video to learn more about the flaws of classic anonymization
Personal Data Assets Are Locked Up
Keeping the privacy of their customers safe and secure is of utmost importance to conscientious organizations. In addition, the fear of privacy breaches and GDPR fines of up to €20 million per breach leads to privacy-sensitive data assets being strictly locked away.
But this severely hampers data-driven innovation and collaboration. The status quo in most industries is, that it takes 6-8 months to get access to customer data, resulting in high costs due to case-by-case approvals, expensive project delays, and missed opportunities.
But how should an organization ever become data-driven and customer-centric if it can’t freely collaborate and innovate on top of its customer data?
Synthetic Data reconciles Data Innovation with Data Privacy
Synthetic Data is as-good-as-real
Advances in machine learning enable the generation of highly realistic and highly representative synthetic datasets that resemble the characteristics as well as diversity of actual people. Synthetic data generated with Mostly GENERATE is capable of retaining ~99% of the value and information of your original datasets. This unprecedented accuracy allows using synthetic data as a replacement for actual, privacy-sensitive data in a multitude of AI and big data use cases.
Synthetic Data is fully anonymous
The pitfall of classic anonymization techniques is that they mask or obfuscate only parts of the data while leaving everything else intact. But in the era of big data, there is no non-sensitive attribute – and leaving information intact provides a target for adversaries to perform de-anonymization attacks.
Synthesizing data, on the other hand, is a fundamentally different approach to big data anonymization. Instead of changing an existing dataset, a deep neural network automatically learns all the structures and patterns in the actual data. Once this training is completed, the model leverages the obtained knowledge to generate new synthetic data from scratch. This artificially generated data is highly representative, yet completely anonymous. As it does not contain any one-to-one relationships to actual data subjects, the risk of re-identification is successfully eliminated.