💡 Download the complete guide to AI-generated synthetic data!
Go to the ebook

It’s 2020, and I’m reading a 10-year-old report by the Electronic Frontier Foundation about location privacy that is more relevant than ever. Seeing how prevalent the bulk collection of location data would become, the authors discussed the possible threats to our privacy as well as solutions that would limit this unrestricted collection while still allowing to reap the benefits of GPS enabled devices. 

Some may have the opinion that companies and governments should stop collecting sensitive and personal information altogether. But that is unlikely to happen and a lot of data is already in the system, changing hands, getting processed, and analyzed over and over again. This can be quite unnerving but most of us do enjoy the benefits of this information sharing like receiving tips on short-cuts through the city during the morning rush hour. So how can we protect the individual’s rights and make responsible use of location data?

Until very recently, the main approach was to anonymize these data sets: hide rare features, add noise, aggregate exact locations into rough regions or publish only summary statistics (for a great technical but still accessible overview, I can recommend this survey). Cryptography also offers tools to keep most of the sensitive information on-device and only transmit codes that compress use-case relevant information. However, these techniques make a big trade-off between privacy, accuracy, and utility of the modified data. Even after this preprocessing, if the original data retains any of its utility then the risk of successfully re-identifying an individual is extremely high.

D.N.A. is probably the only thing that’s harder to anonymize than precise geolocation information.

Paul Ohm, law professor and privacy researcher at the Georgetown University Law Center. Source: NYT

Time and again, we see how subpar anonymization can lead to high-profile privacy leaks and this is not surprising, especially for mobility data: in a paper published in Scientific Reports, researchers showed that 95% of the population can be uniquely identified just from four time-location points in a data set with hourly records and spatial resolution given by the carrier’s antenna. In 2014, the publication of supposedly safe pseudonymized taxi trips allowed data scientists to find where celebrities like Bradley Cooper or Jessica Alba were heading by querying the data based on publicly available photos. Last year, a series of articles in the New York Times highlighted this issue again: from a data set with anonymized user IDs, the journalists captured the homes and movements of US government officials and easily re-identified and tracked even the president.

Mostly AI’s solution for privacy-preserving data publishing is to go synthetic. We develop AI-based methods for modeling complex data sets and then generate statistically representative artificial records. The synthetic data contains only made-up individuals and can be used for test and development, analysis, AI training, or any downstream tasks really. Below, I will explain this process and showcase our work on a real-world mobility data set.

The Difficulties With Mobility Data

What is a mobility data set in the first place? In its simplest form, we might be talking about single locations only where a record is a pair of latitude and longitude coordinates. In most situations though, we have trips or trajectories which are sequences of locations.

Fig. 1: A piece of the Porto Taxi data set

The trips are often tagged by a user ID and the records can include timestamps, device information, and various use-case specific attributes. 

The aim of our new Mobility Engine is to capture and reproduce the intricate relationship between these attributes. But there are numerous issues that make mobility data hard to model.

Our solution can overcome all these difficulties without the need to compromise on accuracy, privacy or utility. Let me demonstrate.

Generating Synthetic Trajectories

The Porto Taxi data set is a public collection of a year’s worth of trips by 442 cabs running in the city of Porto, Portugal. The records are trips, sequences of latitude/longitude coordinates recorded at 15-second intervals, with some additional metadata (such as driver ID or time of day) that we won’t consider now. There are short and long trips alike and some of the trajectories are missing a few locations so there could be rather big jumps in them.

Fig. 2: Two hundred randomly selected taxi rides in Porto

Given this data, we had our Mobility Engine sift through the trajectories multiple times and learn the parameters of a statistical process that could have generated such a data set. Essentially, our engine is learning to answer questions like

You can imagine that if you are able to answer a few million of these questions then you have a good idea about what the traffic patterns look like. At the same time, you would not learn much about a single real individual’s mobility behavior. Similarly, the chance that our engine is reproducing exact trips that occur in the real data set, which in turn could hurt one’s privacy, is astronomically small.

For the case of this post, we trained on 1.5 million real trajectories and then had the model generate synthetic trips. We produced 250’000 artificial trips for the following analysis, but with the same trained model, we could have as easily built 250 million trips.

Fig. 3: A taxi trip generated by our Mobility Engine

First, for a quick visual, we plotted 200 real trips recorded by real taxi drivers (on the left, in blue) and 200 of the artificial trajectories that our model generated (on the right, in green). As you can see, the overall picture is rather convincing with the high-level patterns nicely preserved.

Fig. 4: Two hundred real (left) and two hundred synthetic trips (right)

Looking more closely, we see very similar noise in the real and synthetic trips on the street level.

Fig. 5: A street-level view of real (left) and synthetic trips (right)

In general, we keep three main things in mind when evaluating synthetic data sets.

As for accuracy, we compare the real and synthetic data across several location-and trip-level metrics. First, we require the model to accurately reproduce the location densities, the ratio of recordings at a given spatial area, and hot-spots at different granularity. There are plenty of open-source spatial analysis libraries that can help you work with location data such as skmobgeopandas, or Uber’s H3 which we used to generate the hexagonal plots below. The green-yellow-red transition marks how the city center is visited more frequently than the outskirts with a clear hot-spot in the red region.

Fig. 6: Visualizing hot-spots (red), frequently visited areas (yellow) and less often visited regions (green and translucent)

From the sequences of latitude and longitude coordinates, we derive various features such as trip duration and distance traveled, origin-destination distance, and a number of jump length statistics.  

Fig. 7: The synthetic and real data follows the exact same distribution across various trip-level metrics

The plots above show two of these distributions both for the real and synthetic trips. The fact that these distributions overlap almost perfectly shows that our engine is spot-on at reproducing the hidden relationships in the data. To capture a different aspect of behavior, we also consider geometric properties of paths: the radius of gyration measures how far on average the trip is from its center of mass and the straightness index is the ratio of the origin-destination distance with the full traveled distance. So, for a straight line, the index is exactly 1 and for more curvy trips it takes lower values with a round trip corresponding to straightness index 0.  We again see that the synthetic data follows the exact same trends as the real one, even mimicking the slight increase in the straightness distribution from 0.4 to 1. I should stress that we get this impressive performance without ever optimizing the model for these particular metrics and so we would expect similarly high accuracy for other so far untested features.

Fig. 8: From a geometric point of view, the real and synthetic trips still look the same

Regarding utility, one way we can test-drive our synthetic mobility data is to use it in practical machine learning scenarios. Here, we trained three models to predict the duration of a trip based on the origin and destination: a Support Vector Machine, a K-nearest neighbor regressor, and an LGBM regressor,

Fig. 9: We see a highly competitive performance by the synthetic models and the same relative trend in model performance.

We trained the synthetic models only on synthetic data and the real models on the original trajectories. The scores in the plot came from testing all the models against a holdout from the original, real-life data set. As expected, the synthetically trained models performed slightly worse than the ones that have seen real data but still achieved a highly competitive performance.

Moreover, imagine you are an industry-leading company and shared the safe synthetic mobility data in a competition or hackathon to solve this prediction task. If the teams came up with the above-shown solutions and got the green error rates on the public, synthetic data then you can rightly infer that the winner solution would also do best on your real but sensitive data set that the teams have never seen.

The privacy evaluation of the generated data is always a critical and complicated issue. Already at the training phase, we have controls in place to stop the learning process and thus prevent overfitting. This ensures that the model is learning the general patterns in the data rather than memorizing exact trips or overly specific attributes of the training set. Second, we also compare how far our synthetic trips fall from the closest real trips in the training data. In the plot below, we have a single synthetic trip in red, with the purple trips being the three closest real trajectories, then the blue and green paths sampled from the 10-20th and 50-100th closest real trips.

Fig. 10: The red synthetic trip is surrounded by the most similar real-life trajectories

To quantify privacy, we actually look at the distribution of closest distances between real and synthetic trips. In order to have a baseline distribution, we repeat this calculation using a holdout of the original but so far unused trips instead of the synthetic ones. If these real-vs-synthetic and real-vs-holdout distributions differ heavily, in particular, if the synthetic data falls closer to the real samples than what we would expect based on the holdout, that could indicate an issue with the modeling. For example, if we simply add noise to real trajectories then the latter distributions will clearly flag this data set for a privacy leak. However, the data generated by our Mobility Engine passed these tests with flying colors.

Conclusions

We should be loud about not compromising on privacy, no matter the benefits offered by sharing our personal information. Governments, operators, service providers, and others alike need to take privacy seriously and invest in technology that protects the individual’s rights during the whole life cycle of the data. We at Mostly AI believe that synthetic data is THE way forward for privacy-preserving data sharing. Our new Mobility Engine allows organizations to fully utilize sensitive locational data by producing safe synthetic data at a so-far unseen level of accuracy.

Acknowledgment

This research and development on synthetic mobility data is supported by a grant of the Vienna Business Agency (Wirtschaftsagentur Wien), a fund of the City of Vienna.

Why Privacy Matters?

“Knowledge is power” should be a familiar quote for most of us. But did you know that in the earliest documented occurrence of this phrase from the 7th century the original sentence finishes with  “…and it can command obedience”? If we apply this idea to personal information it’s easy to understand why privacy matters. Nobody wants to be at the receiving end.

Today privacy is recognized as one of the fundamental human rights by the UN and protected through international and national level legislation. Not complying with privacy regulations can be very costly if not a business-threatening risk for organizations working with their customers’ data.

Should We Trust Blindly?

Never before has it been as difficult as now to preserve individuals’ privacy. People have lost track and control of what kind of personal data, how much of it, why and where it is being collected, stored and even sold further. From “1000 songs in our pockets” (Apple’s first iPod slogan) we moved to an endless number of apps and trackers in our pockets, recording our daily activities. In our homes as well as in public spaces, devices with sensors and meters record every step and action we take.

As an individual, it’s wise to question if the consent given to a data controller is compliant with current privacy regulations. Because sometimes data controllers turn to the Dark Side by collecting unnecessary data, sharing data with third parties, failing to protect data appropriately or using it in a discriminatory way against the data subjects.

The Industry’s Dirty Secret About Privacy

But even the data controllers who are not crossing the legal red lines are struggling more and more to protect their customer’s data at rest as well as in motion. Their life would be much easier if customer data could be locked up in a safe. But data, if not the new oil of our economy, is for sure the lubricant necessary to keep the business machine running. This means that every company needs to expose its customer’s data at least for the following purposes:

Often it’s not the bad intention but simple incapability of data controllers to comply with privacy regulations because the instruments at their hands are not capable of securely anonymizing data without destroying all of the utility in it. I remember many discussions with data security officers at different companies admitting that their current processes aren’t compliant but as long as there is no technical fix they can’t stop the machine that’s running.

What do those data security officers mean when they say that there’s no technical fix for privacy compliance? Although there are many different methods to protect data like cryptography, tokenization and anonymization, all of them have some weak points when it comes to data protection.

Sometimes the weak point is the human: If cryptographic keys are compromised the privacy is lost. Sometimes, like in the case of classic anonymization techniques, the weak point is the technology. What more and more people realize is that classic anonymization (randomization, generalization,…) is at the same time an optimization problem and an uphill battle. When original data is modified it loses some of its information value and therefore its utility. The more protection the less utility and vice versa. The data anonymization engineer is continuously balancing between the legal requirements to protect the privacy of the data subjects and the business requirements to extract value out of the data.

But in the age of big data with millions of records and thousands of attributes this optimization is getting almost impossible to achieve. The more data points are being collected about individuals, the easier it gets to re-identify individuals even in large databases. For example, 80% of all credit card owners are re-identified by 3 transactions, even when only the merchant and the date of the transaction are revealed.

The anonymization trade-off has shifted to a paradox state where only useless data is sufficiently protected data. So if you have been using anonymized, granular level data for analytics or testing, the odds are high that a significant fraction of this data is actually re-identifiable and the privacy of your customers isn’t adequately protected.

This dirty secret is causing some sleepless nights in legal and security departments across various industries. What’s even scarier is that some organizations aren’t even aware of it. And the only thing worse than not protecting your customer’s privacy is acting with the false impression that you are doing it, which leads to organizations taking their guard down to freely share poorly anonymized information internally and externally.

The trouble is on the horizon. In 2019 GDPR fine total went up to almost 500 million Euros and it wouldn’t surprise me if this amount would double in 2020. But the legal fines aren’t the biggest trouble awaiting data controllers.

What Happens When Privacy Is Breached?

Nowadays, data breaches happen more frequently even when access to sensitive data is restricted. According to IBM, the odds of experiencing a data breach went up 31% in the last 5 years to a 29% likelihood of experiencing a data breach in the next two years. Data snoopers are the bank robbers of the 21st century. The average total cost of a data breach is millions of Euros and that doesn’t only consist of regulatory fines. The biggest cost factor of a data breach is the loss of reputation that directly leads to the loss of business.

This Sunday a story was published in British media holding all the components of a perfect privacy nightmare: a data breach of a governmental database containing personal information of about 28 million children. The breach happened through the access that the UK government has granted to an educational and training provider.

Just imagine what horrible things could be done with that data. Now go through your diabolical list and strike out “selling children’s private data to betting firms” – this task has already been successfully accomplished. How successfully? According to UK media, betting companies were able to significantly increase the number of children passing their identity checks and have used the stolen data to increase the number of young people who gamble online. The house always wins.

And there’s no sign that the winning streak of data adversaries will break anytime soon. Every year we see the number of data breaches and the volume of exposed sensitive data captured rising. With a total of 550 million leaked records last year, the 28 million data records of the recent breach in the UK will hardly break the top five data breaches of 2019.

The types of breaches go from incompetency like accidental exposure, employee error, improper disposal or lost to intentional theft like hacking, insider theft, unauthorized access, and even physical data theft.

In the dark web the data is offered to other criminals or even legal entities like the betting companies in the example above, as well as social media organizations who use this data among others to check if their own passwords were indirectly exposed.

What Are The Alternatives?

The list of things that could be done to reduce privacy risk is long. When data is in motion, for example during business process operations, data controllers will continue to use cryptography and tokenization as preferred methods to protect data in operative applications. For data at rest used for data analytics, development & testing, machine learning and open innovation there are new innovative ways to truly anonymize data and fix the privacy vs. utility trade-off once for all.

AI-generated synthetic data is THE way forward to preserve the utility of data while protecting the privacy of every data subject. This innovative approach was only possible thanks to the progress in the field of artificial intelligence. Generative deep neural networks can be trained on the original structured data to then be used to generate new synthetic data points. These new synthetic datasets preserve all the correlations, structures and time dependencies present in the original data. Customer-related events like financial transactions, online-clicks, movements etc. can be synthesized and all the important insights contained in the original data can be preserved in the synthetic dataset. And at the same time there’s no way to re-identify the original customer.

With synthetic data, the way to do machine learning with anonymous “full privacy-full utility” data is wide open. The evaluation results obtained by doing machine learning on synthetic data are very similar to those generated from the original data. Application testing with synthetic data is able to cover the edge cases which are normally covered only in the original data. And the performance tests of the applications are not a problem anymore because synthetic test data, in contrast to original data, isn’t scarce anymore. Millions of synthetic records can be produced with the click of a button.

With AI-generated synthetic data being used more and more, data security and legal departments will finally be able to sleep easier. At the same time, the data scientists and development & testing teams will be able to focus on more productive tasks and won’t get distracted by legal and security requirements. Trust is good, control is better, but no dependency on trust is the best.

And what about data adversaries? What would happen if they manage to steal 28 million synthetic records of UK children like in the story above?

Nothing.

These data adversaries would be sitting on terabytes of artificially generated data: nothing more than a high-quality-look-a-like dataset. They’d be painfully disappointed to discover that re-identifying any of the 28 million UK kids would not be possible. And you can bet on that.

Netflix has been working on a recommender systems for years by then, were recognized for their business innovation as well as technical excellence, managed to hire the smartest data engineers and machine learning experts alike, and certainly knew all the ins and outs on movies, genres and popular actors. Still, they were eager to learn more.

The competition began on October 2nd, 2006. Within a mere six days a first contestant succeeded in beating their existing solution. Six days! From getting access to the data and taking a first look, to building a movie recommender algorithm from scratch, all the way to making more accurate predictions for over a million ratings than anyone before. Six days, from zero to world class.

It didn’t stop there. Within a year, over 40,000 teams from 186 countries entered the competition, all trying to improve on Netflix’ algorithms. And contestants increasingly started collaborating, sharing their learnings and taking lesson from others, formed bigger teams and ensembled ever more powerful models. Hardly ever before was there such a rich & big dataset on consumer behavior openly available. Together with a clearly stated and measurable objective, it provided a challenging, yet safe and fair sandbox for a worldwide community of intellectually driven engineers. All working together on advancing science, while, as a more than welcome side effect, also helping Netflix improve its core algorithm.

The Case for #OpenBigData

Being open enables to pick the brain of a wider group, to bring in fresh perspectives to existing challenges, to build upon the creative minds of the many. And with data being the lingua franca of today’s business world, being the common denominator across departments, across corporations, across industries, being open is really about sharing data, about sharing granular level data at scale!

Openly sharing customer data at scale in 2006 was a bold move. But it was no coincidence that it’s been done by Netflix. Already in their early years they successfully established a culture of excellence, curiosity as well as courage, well documented in “one of Silicon’s valley’s most important PowerPoint decks” [1]. Sharing data broadly takes courage, but is even more so a sign of curiosity and a thrive for excellence. No holding back, no hiding out, no making excuses. Netflix was never afraid of their competitors. They were afraid of stopping to strive for the best.

Openly sharing customer data at scale in 2019 is an (unl)awful move. Over the past years, the explosion in data volumes met a poorly regulated market, with little sanctions being imposed, and thus allowing excessive misuse of personal data. The tide though has turned, both the regulators as well as corporates are acknowledging privacy as a fundamental human right, one that is to be defended [2]. This is indeed a new era of privacy.

Unfortunately, this plays into the hands of modern-day corporate gatekeepers. Those decision makers, who’ve never really been fond of being transparent and being challenged, and thus reluctant to share “their” data in the first place. It turns out they found a new ally in defending their corporate data silos: privacy.

The Case for #SyntheticData

This is the point where one needs to tell the lesser known part of the Netflix Prize story: As successful the competition was for the company overall, they also had to pay their prize at court. In fact, they were forced to cancel their second machine learning challenge, that was planned for 2011 [3]. Netflix had misjudged their anonymization measures they’ve had put in place. Even though they limited data to movie ratings and their dates, merely linked to a scrambled user ID, it proved inefficient to prevent re-identification. It had taken only 16 days, after the data was released for an outsider (with no superhuman hacking skills) to link these user IDs with freely available public data – with enough time left at hand to write up a whole paper on de-anonymization [4]. Netflix had unintentionally exposed the full movie history for parts of their customer base, with no chance of making that privacy infringement undone. A decade later Facebook had to learn that same painful lesson. Once the data is out and you failed to properly anonymize, no matter how good your intentions might have been, you will have a hard time to undo your actions.

This risk of re-identification in large-scale data is by now well understood by privacy and security experts [5], and yet still widely under-estimated by the corporate world. That’s why these experts face a challenging role within organizations, as they need to educate their colleagues, that most anonymization attempts for big data in fact fail to provide safety for their customers. And these experts are forced to give a NO more often than a YES to a new initiative, to a new innovation project, in order to keep privacy safe and secure.

of mobile phone owners are re-identified simply by 2 antenna signals, even when coarsened to the hour of the day

of credit card owners are re-identified by 3 transactions, even when only merchant and the date of transaction is revealed

of all people are re-identified, merely by their date-of-birth, their gender and their ZIP code of residence

So, this is the big quest of our time: How to be open, while being private at the same time? How to put big data to good use, while still protecting each and everyone’s right for privacy? How to foster data-driven, people-centric and innovative societies and organizations, all at the same time, while not giving up an inch on safeguarding privacy?

We, at Mostly AI, set out to solve this challenge, and developed a one-of-its-kind technical solution to this long-standing problem: an AI-based synthetic data generator. One that learns based on actual behavioral data, to generate statistical representative synthetic personas and their data. Synthetic data, that can be broadly shared, internally as well as externally, without exposing any individuals. It’s all the value of the original data, but without the privacy risk.

This is 2019. It’s time to protect privacy, as well as to embrace the power of open again. It’s time to #GoSynthetic!

magnifiercross