Meaningless privacy guarantees vs. true privacy with Yves-Alexandre de Montjoye

Alexandra Ebert: Hello and welcome to the Data Democratisation Podcast. I'm Alexandra Ebert, your host and MOSTLY AI's Chief Trust Officer. This is our 30th episode and I have another extraordinary guest for you, Yves-Alexandre de Montjoye. Yves-Alexandre is Associate Professor at the Imperial College of London, where he also heads the Computational Privacy Group.

He's one of the most influential researchers when it comes to privacy and de-identification. His papers can be found in prestigious journals like Natures and others. Besides his extraordinary scientific work, Yves-Alexandre is also a sought-after advisor for policymakers. Currently, he serves as a special adviser on AI and Data Protection to the European Commission's Justice Commissioner Reynders. He is also a parliament-appointed expert to the Belgian Data Protection Authority.

In 2018 and 2019, he was also a Special Adviser for Margrethe Vestager. He co-authored the Competition policy for the digital era report for her. What we'll follow in this episode today is a fascinating discussion on the misconceptions and the shortcomings of traditional anonymization. The challenges of implementing differential privacy in practice and Yves-Alexandre's take on the pros and cons of synthetic data. I'm sure you will enjoy this episode. Let's dive right in.

Yves-Alexandre, it's great to have you on the show, I was very much looking forward to discussing all things privacy and anonymization with you. Before we get started, could you briefly introduce yourself to our listeners, and maybe also briefly mention what makes you so passionate about privacy and about important research work that you do in that field?

Yves-Alexandre: Of course, happy to. I'm an Associate Professor at Imperial College in London, where I lead the Computational Privacy Group. What makes me passionate about privacy protection is I think privacy is so fundamental to who we are in our societies. Technically, I think there's a real need to be able to find this balance between using data for good and how we can use data to learn more about ourselves, to solve some of the toughest challenges that we're facing all while preserving privacy. I think that's for us and the research we do that we need to go with this. How can we get most of the good without the bad?

Alexandra: I would say it wouldn't be overestimation to say it's one of the most important fields nowadays with the increasing importance of data in our lives and of artificial intelligence to really find ways on how to strike this balance right. Also, what you have already done repeatedly in the past, make the industry into regulators aware of the risks that come with poor privacy protection so that we can really improve in that area. You wanted to briefly mention you're only saying this or you're?

Yves-Alexandre: Yes, yes, absolutely. It's just I need to mention that obviously, everything I say in this podcast, all the views, opinions are obviously only mine, and neither one of the institutions I work for.

Alexandra: Agreed, agreed. That of course makes sense. Okay, let's jump in the first topic. I am curious to get your thoughts on around the misconceptions of traditional anonymization. What would you say are the biggest misconceptions and also the limits of traditional anonymization techniques that might or not be that well understood in the industry and by regulators?

Yves-Alexandre: There's a lot of misconceptions. I think the main one to me is really the fact that the techniques and the way we've been thinking about anonymization has not progressed much, while the ability, the amount of data that we have been recording has changed dramatically. I think that's really where a lot of the issues that we're seeing today are coming from is basically we're still applying techniques that were invented in the '90s when we barely had the internet in Europe.

We were dealing with Excel-like data and a handful of columns, and 10s of 1000s of records. We're still trying to apply these techniques to the world of big data. The massive amount of data collected by IoTs in our phones and our cars and our smart meters and that fundamentally, they just don't have the same-- I mean, both of them are data, but they're really not the same data. I think that's really what the main misconception is, and why I think a lot of the traditional techniques we've been using do not really work anymore today.

Alexandra: Yes, absolutely. I think you can say that particularly this type of behavioral data you mentioned, the mobility traces, the financial transaction behavior, and so on and so forth is just a whole different piece to tackle when it comes to anonymization. Maybe for those who are-- Oh sorry.

Yves-Alexandre: No, please go ahead.

Alexandra: Maybe for those who are a little new to this field of privacy and anonymization, can you give an example to illustrate why is particularly behavioral data hard to anonymize with these outdated methods?

Yves-Alexandre: Yes, that's exactly what I was going to going to talk about I think. Basically, one of the example I like to take is k-anonymity. Basically, k-anonymity is this idea that you look at a data set and you're going to modify the data set using a range of techniques, ranging from suppression to generalization to some kind of randomization, to try to basically make sure that no one in this data set is identifiable in more than a group of at least k-people.

The idea is basically that even if I were to know quite a lot of information about yourselves, potentially even close to everything that is in the data set, and I use this information to search for you, I will end up with a group of at least k-people. Then you try to take this notion and you try to apply it to something like location data in which people are moving around like it's location data, and even take really, really not very precise, not the GPS-type location data and let's just even say that it's really coarse location data.

Take London, take the zip code where you are on an hourly basis, it's by no means the most fine-grained location data we have today. Even that, what k-anonymity would mean when you apply to this kind of data, it's like literally, you need to modify the data set so that even if I were to know a certain number of places and times where someone was, this person will always be part of a group of k-people.

It means that literally, you need to modify the data set in a way in which in general, there's always at least k-people that are going to be at the exact same place at the exact same time as you in this data set. I think that really explains why-- Yes, k-anonymity when you're talking about a date of birth and a zip code, and a gender-- Yes, I mean, making sure that there's at least k-people of the same gender living in the same zip code, having the same year of birth is usually doable.

When it comes to location data, when it comes to a week or month of this data, it literally means that every single hour, you need to have at least a group of k-people that systematically traveled together, and were at the exact same place at the exact same time, which I think to me shows how difficult it would be to even apply one of the most basic notion of data deidentification to location data to large scale behavioral data sets that we're really interested in using today.

Alexandra: Absolutely, I think the example you've just illustrated shows how impossible it is to then at the same time achieve this k-anonymity and still have a useful data set. We're not I don't know 99.9% of information has to be deleted because you didn't have these matches of k-people being on the exact same locations. How well would you say or in your impression is this risk of traditional anonymization techniques when they are applied to modern big data, behavioral data, and the stats in the industry and on the regulatory side?

Yves-Alexandre: I think things have been improving, to be honest. I think when we started doing this research it was quite mind-blowing the extent to which, at least in my opinion, some people in the field team seemed to not really understand the revolution in terms of the amount of data that was being collected that was happening before our eyes. When we published some of the first re-identification attacks, we had people telling us that it was not really different and that actually data could still be de-identified.

I think now close to 10 years later, I think it's generally accepted that high dimensional datasets cannot really be technically de-identified, cannot be kept at individual level, and made robust to re-identification attacks in general. I think it took quite a long time but I think we are getting there. There is an increasing recognition of this fact both in the industry and certainly by regulators

Alexandra: Definitely. I also see that there's a change and that there's in general more awareness and understanding of these facts but still, I sometimes struggle to wrap my head around how it could still be possible that for example in the mobility industry or telco industry, mobility data is considered to be anonymous even though it's just, I don't know hashed every 24 hours and closer to pseudonymized data.

There are still so many data practices out there where I'm really struggling to understand how this can still continue. One other thing that I'm curious to get your perspective on, in general, is differential privacy. Particularly when you look at privacy conferences in the past years, it was considered to be somewhat the new holy grail of privacy protection. What are your thoughts on differential privacy particularly, also in the context of your recent Nature Communications paper that you and two colleagues published, and I think it was titled on the difficulty of achieving differential privacy in practice user-level guarantees and aggregate location data?

Yves-Alexandre: A lot of techniques, I think different sets of privacy I think is a significant improvement over what existed before, but I think every technique it would be a mistake to consider it to be the silver bullets and the proof that we have forever solved the issue of privacy. The first one is more philosophical which is differential privacy takes a specific perspective on what privacy means and comes up with a definition, and then a mathematical framework around this definition.

I would love for us to have managed to encompass this really complex human-evolving notion of privacy into one mathematical formula, but I do think that this isn't likely to be true. Differential privacy is not privacy. It's one specific really well-thought-out take on what privacy means and how to protect it. I'm very sure that there exist cases of something that is absolutely perfectly mathematically differentially private yet that we as a society are going to find to be a privacy violation.

Alexandra: Now, you made me curious. Sorry to interrupt here, but if you say differential privacy is just one perspective on how you can frame and see privacy. Do you have some examples for us where another perspective would be useful?

Yves-Alexandre: I think a lot of it comes from where differential privacy comes from in a sense. It very much takes the idea that you do not want to discourage participation in a survey or in your senses. It very much focuses on this idea that effectively the outcome of the algorithm that's going to be run on the data does not really depend on the fact that you participated in this survey or not. Again, that is one specific and then the opposite is they would say, "Oh, the typical example would be something like oh," yet you want to use this data. This famous example of smoking causing cancer.

This is something that is a statistical fact that you want to actually learn from the data and is not a privacy violation. Then you have quite a lot in the middle on things that we might find sensitive from a privacy perspective that would be according to differential privacy not sensitive. You could imagine anything that would be for example statistical facts about a group that would be small enough for example and that we would be, "This is becoming too small. This is becoming potentially sensitive."

Again, I think the notion of differential privacy is quite a strong one centered around the individuals. I think it's a very good, very solid notion. It's just I think we need to be clear that it does not necessarily encompass everything that we want to protect with privacy. That's more of the philosophical one. Then there is the technical one, and then there is more fundamentally the question of whether differential privacy is achievable in practice.

I think my personal opinion on this is that differential privacy is amazing. It's beautiful. It's a beautiful theory. As you would expect it's really well-through out in terms of really quantifying information leakage and the way you can control information leakage by adding a certain amount of noise to basically every single piece of information you're giving out. It works beautifully in a simple context.

You have a lot of really well-crafted examples in which it works beautifully. Then, when you look at more complex examples and especially for example in the context of the paper we looked at which is location data, time-dependent data. Basically, you want to do multiple releases over time, every month, for example. I think these are more complex skills in which you see some of the limitations of differential privacy at least as we know it today. In which basically the know the simple way of doing differential privacy is basically to think that you have this budget and you set this budget and basically you spend this budget the way you want. Basically, you take each release and each release costs you something in your budget.

The issue was time-dependent data, for example, is at least the basic way to do this would basically be, "Okay, every data release, every time I tell you the piece of information you want to know every week, for example, it costs you something." The issue is unless what it costs you is negligible, very quickly as you keep giving me the information, week after week, after week, after week, at some point you would basically run out of budget. Your budget is finite and if you want to keep doing it every week, at some point there's going to be an issue.

In that specific example-- That's a difficult thing to say. You can say, "These are completely two different data releases." That's actually what Google was doing in the example we looked at. Basically, they assume that every data release is basically a completely new data set, and therefore basically you did not have to care for this notion of budget which is very central in differential privacy, or you take the other way.

That assumes that where I am today, and the way I'm behaving today, and what I'm doing this week has basically nothing to do with the way I'm going to behave the week after and that how I've been behaving in week one and week two tells you absolutely nothing of how I will behave in week three. Actually, we showed in another paper that this obviously, and just to confirm you know what we know is that, "Yes, we're human, we're creatures of habits, we talk to specific people, we live in a specific region, we like to go to specific shops." Very clearly, that is not true, right?

Alexandra: Absolutely.

Yves-Alexandre: The way we behave is actually really stable. In another paper, we looked at interaction data. We basically looked at how you use, for example, something like WhatsApp to talk to people at a specific time, you're sending messages back and forth, et cetera, et cetera. Even-- [crosstalk]

Alexandra: How quickly you respond or something like that.

Yves-Alexandre: Yes, exactly. Whether you're going to respond, how many people do you talk to? How often do you talk to these people? How is the conversation going? Is it bursty, as Vestager would call it; which is short, back and forth, and then you don't talk to one another for any time between a few hours to a few weeks? How many people you talk to. If you call, are you spending most of your time with one person or are you equally distributing your time across people?

Actually, we showed in a recent paper that the way you do this is actually super specific, super unique to you, but also stable over time, meaning that the way you exchange with people in a specific week is actually quite similar to how you're going to do it the week after and the week after. Actually, we showed that even just interaction data, just literally how you exchange these messages and calling people is stable enough that we can use this to identify you later on. I would need to check the numbers. I think in 40,000 people, more than 50% of the time.

Alexandra: That's amazing.

Yves-Alexandre: Even worse and even more fascinating to me is that even if you look at-- Much later in time, I think we looked 20 weeks. Even 20 weeks down the line, this information is still really accurate to identify you. The way you behave today and the way you're going to behave 20 weeks from now, is actually-

Alexandra: Going to be similar?

Yves-Alexandre: - going to be similar. Not even similar, similar enough that you can be, "This is Alexandra now in March. 20 weeks down the line, that's my guess. This person, I think, is Alexandra," and I would be right a good fraction of the time.

Alexandra: What does this mean for attack scenarios that you not necessarily only have to look for linkage attacks or protect against linkage attacks, but basically have this whole behavioral fingerprint that stays so consistent over time, as you just explained, which I mentioned would open up the world for pattern attacks, or profiling attacks. What does this mean in practice and how can you address it?

Yves-Alexandre: That shows to me how difficult it is in a sense to think of all the potential attacks. Coming back to what we discussed, that was one of the reasons of differential privacy is actually to not have to think about all the potential attacks and to have something that is robust, even against literally any attack. The issue is that, as we discussed, this is sometimes really, really hard to achieve with good utility in practice and so you need to make compromises.

The way I see it is we're improving, we've been improving quite a lot on the defense side. We've been improving quite a lot in understanding that basically, the old deidentification defenses are just, by no means good enough today. We had a really, really strong one, potentially they were too strong and not applicable in practice. Now, there's a lot of research in trying to basically adapt the defenses to keep them very strong, but more applicable in practice.

I think there's quite a lot of work that is going on in building robust, yet more useful, applicable in practice defenses. Then I think on the other side, you have the same with attacks. Attacks are becoming increasingly sophisticated and accurate in trying to learn the general patterns of a person increasing using machine learning techniques to try to assess the extent to which you've really properly anonymized this data. I really do think that you need both.

You need both a lot of work in trying to define what are attacks, what are potential attacks, what's foreseeable, what's coming in terms of what is becoming possible in re-identifying people or inferring information about people from anonymous data, and really, how do you use this to then basically evaluate the defenses? Then using the attack to evaluate the defense, build better defenses, build better attacks, et cetera, et cetera. I do think that differential privacy really moved the needle forward quite significantly in the robustness of the defenses.

Now, to make them useful, we need to bring them down a little bit. Attacks are going to be very useful to see are we bringing them down reasonably or recently, basically, effectively creating vulnerabilities?

Alexandra: That makes sense. Maybe coming back to the Nature paper and to give our listeners some context, in case they're not familiar with it. You analyzed I think the differentially private mobility data of I think it was 300 million Google Maps users and you mentioned beforehand that applying differential privacy or achieving differential privacy in practice is difficult.

At the same time, if you look at it from the high level, you see, there's an epsilon of I think it was 0.66 or something like that, so rather a good value with the recommendation of epsilon values for, I don't know, below one in EDL case or not higher than seven in other cases. You would think, "Hey, well I have a mathematical privacy guarantee now differentially private." What was the problem here? Why wasn't that information private? Can you walk us through your findings briefly?

Yves-Alexandre: Yes. We did not analyze this data. We only analyzed what was reported in the paper in terms of how they've protected the data. Basically, the issue is and this is why-- Again, that's where you see some of the limitations is basically what they explained afterward, is that actually what they were protecting using this mechanism was a single trip.

Basically, they were applying what is called event-level differential privacy, which is basically the fact that you or a person basically took a specific trip from place A to place B because these were mobility matrices, basically, how many people have been going from place A to place B on a specific week, for example. That was protected.

That's obviously very different from what we call user-level differential privacy, which is basically the idea that you being part of this data collection was protected. Basically, they protected a single trip. Our argument was to say, "That's not really what matters. What matters is not--" I think the other way around is effectively making the assumption that if I were to search for you in this data set, all I would know is one single trip you made. We think that's completely unrealistic, obviously, because if I'm searching for you, I'm likely to know a lot more information about you than just one trip.

Basically, we showed that actually, the guarantees are much, much, much weaker from a differential privacy perspective, and the epsilon on therefore higher as you assume to know a lot more of the trips that a person has been making. Basically, was it one trip, which is what they report in the paper? They have an accuracy of a very strong membership attack, basically to be 66%. Basically, what we showed is that it goes up very, very fast as you know more trips taken by a person over a week.

I think we looked at my data. I think on the specific week we looked at, I think I took 39 trips, 32 of them were actually unique trips, meaning that I was going from a specific place A to a specific place B. Basically, which--

Alexandra: Not your regular home-to-work movement.

Yves-Alexandre: Exactly, not the regular home-to-work and just not double counting. Basically, even if I went to work every day that week, that only counts for one trip. Actually, that week, I was making a lot of unique trips, 32 of them. That would actually give an attacker a 95.4% certainty that I was part of this dataset. Again, you can see how protecting one trip is very, very different than protecting me as a user being part of this data set. Importantly, even worse, in the sense is, this is for a specific week. If I'm not mistaken, I think it was a year of data that was considered. You have another factor of 52 on top of this if you assume to, again, taking the very strong differential privacy notion of an attacker that knows basically everything about you and is trying to figure out whether you were part of the data set or not.

It's obviously a very strong attacker but it basically shows I think, how difficult it is to apply differential privacy, for example, in this specific context.

Alexandra: Definitely. I think if I remember correctly, you then also multiplied this and came to an epsilon value for the user-level privacy of I think it was a two-digit number, or even higher, something like 50, or very, very high and therefore meaningless privacy guarantee in practice.

Yves-Alexandre: Exactly.

Alexandra: To sum it up, can we say that applying differential privacy in practice, one of the challenges is the assumptions that you have to take which sometimes could become a little bit unrealistic, or at least a little bit detached from the real world risks, or in a way that you still preserve some utility and therefore, you can then see scenarios like here where the differential privacy was only applied to the event-level and not the user-level where it really would have been needed.

Yves-Alexandre: It's difficult because there are a lot of different dimensions. I think the first thing and we should be very, very clear on this, what we're seeing in this paper is not that there's an obvious attack, and that it's trivial to re-identify someone in this dataset. [crosstalk] The attack in itself is unrealistic. I think we should be very clear. That's the attack that differential privacy is supposed to protect against. Because differential privacy very much comes from the idea that you need something that is so strong that it will protect against this very, very strong attacker so that we can give you really strong privacy guarantees.

What is slightly I think worrisome here is that, in this specific instance, there is a very, very strong guarantee that you are supposed to give users which is what differential privacy is built on, this idea that I am not making assumptions on one attacker can know because I'm taking the strongest possible attacker I could think of, and I'm telling you I protect against this, and therefore, by definition, I protect against all the weaker ones, more realistic ones. The issue is basically, we can do this.

Basically, what is done and what they did in this case, is basically they effectively assumed a little bit through the backdoor, a weaker attacker, by using event-level differential privacy. It's basically either they assume that you only know one trip of a person, or you assume-- That's not even exactly this. Or you assume there's a disconnect between the way you behave, et cetera, that would make an attack unrealistic or unlikely to happen.

The issue is that, in a sense, you can't have it both ways. You can't have this super strong formal privacy guarantee that was designed to avoid the back and forth and the risk that someone could come up with a better attack and then basically weaken it and still claim the former guarantees. I don't know if this is clear, but that's really the struggle. I think the bar is really high and you don't really manage to meet the bar, and then you're lowering the bar, and still claiming that you have this very high bar. That's really the issue.

Again, not to say that the lower bar is not already pretty high. In this case, I think it was because its aggregate location data is 52 matrices, et cetera, et cetera. Again, specifically in the paper, they make claims that are claims of user-level differential privacy, as they say, "It at best improves the level of certainty of a random guess of an attacker to infer for users in the data set by approximately 16% which is linked to user-level differential privacy and a really strong guarantee that your participation as a user is protected according to this level."

Alexandra: Basically, transparency and openness and the communication, what the guarantee actually means and protects, and how strong it is.

Yves-Alexandre: Yes, and I think also the DID recognition that the bar is so high that often in practice you need to simplify, you need to make assumptions. The big question is whether the bar is still high enough with these assumptions to prevent any reasonable attacks, or whether without realizing it by making some of the assumptions your epsilon looks very good, and yet attacks are becoming possible.

That links back to what we discussed before, which is, I think the strong need for considering attacks on top of formal privacy guarantees. Basically using the attack to evaluate the assumptions that you make to basically make formal privacy guarantees work in practice.

Alexandra: That makes sense. One other topic I'm dying to hear your thoughts on and as you can imagine our listeners are also very interested is synthetic data. Maybe before we dive into synthetic data since it's still not a clearly defined term, what do you understand as synthetic data? Can you also imagine some things per se, this shouldn't be termed synthetic data in your opinion?

Yves-Alexandre: To me, it has to be fully synthetic. Otherwise, if you start calling synthetic data something that's half synthetic, et cetera, I think it's dangerous because you're opening up the door to potential attacks, I think, much more than fully synthetic. Just to step back, basically, all of the data has to go through a model, and then the model is spitting out synthetic data and that's the only synthetic data. At least that's what I would call synthetic data. [crosstalk]

Alexandra: Yes, we fully understand the page with you here because we also see it like that.

Yves-Alexandre: It's really partially synthetic, in general, I think I'd be a lot more worried from a privacy perspective.

Alexandra: Agreed. Now that we've cleared the definition, and saw that we are on the same page with AI-generated fully synthetic data, what are your thoughts in general on this technology and where do you see its role now, and also in the future when it comes to privacy protection and utilizing data in a privacy-friendly manner?

Yves-Alexandre: There's a lot of tools out there and in the same way that differential privacy is not a silver bullet, I think synthetic data is not a silver bullet to secure multi-party computation, query-based systems are not a silver bullet, I think. To me, all of these are really, really useful tools in the toolbox. I think it's very much a question of what is the problem you're trying to address and which tool or which combination of tools is best suited to what you're trying to achieve?

I think quite often actually, there is going to be a combination of tools and not a single one. Synthetic data per se, I think is the first use of synthetic data and to me, the biggest one is really, we're data scientists. We know that there's often quite a dramatic difference between what people is going to tell you is captured and exist in the data set and when you actually get your hands on the data and you start analyzing it, what you see in the data set. I tend to think synthetic data is really quite fantastic to be able to basically have data that has a very high level of privacy protection and that you can pretty much give to anyone to get a sense of what the data looks like. It will more or less look like the original data, it will feel like the original data. You will see a lot of the quirks and ideas of what is in the original data without a lot of the privacy concerns and you have individual level data.

Alexandra: Exactly, which is so important for explorability of data, which is great systems or highly aggregated data is just not doing the trick.

Yves-Alexandre: Exactly. You can use this to get a sense of what's in the data, you can use this to write your piece of code, you can use this even to run some more advanced statistical analysis to see the extent to which your database is properly set up, et cetera, et cetera. I think there's quite a broad applicability of synthetic data.

That being said, I do think and as you know, we discussed this before, is I do think that it's also extremely important to realize that this is not the real data. It is fully synthetic data. It's basically, we took the original data, we trained a machine-learning model to learn basically how the real data looks like and then you generate data that looks like the real data. It is not the real data. For some characteristics, some summary statistics, it might be extremely close to the real data, but it's not the real data. For a lot of statistics, it might be very different from what is in the real data. We don't know.

I think what's really important here is to understand that it looks and it feels like the real data, it does not mean it is the real data, it does not mean that some portion or something that you're seeing in the synthetic data will also be true in the real data. [crosstalk]

Alexandra: That's because one of the big-- Sorry to interrupt you. Go on.

Yves-Alexandre: No, please.

Alexandra: That's, of course, one of the two main questions we always get when somebody first starts out with synthetic data and of course, here it's not possible to generalize because there's just not a standardized approach of generating synthetic data. What we see with our clients is that they already use it in production for AI training and when they compare a model trained on synthetic data to the original data, they get so close that they can really use it as a replacement.

I agree with you that in certain scenarios, it might make sense to, for example, first develop something on synthetic data and then check back on the real data, for example, in super high-risk cases, medical emergencies and so on, and so forth. I think that there are many cases where synthetic data can work as a replica or as a replacement for real data. Of course, we're also working on this standardization effort of synthetic data together, that definitely the industry will benefit if there are some standards for not only synthetic data privacy, but also accuracy that no matter whether you're using a vendor, whether you're using self-generated synthetic data or an open-source tool, you have some industry-wide accepted standards to really check back how close are you to the real data and how far you can go.

Yves-Alexandre: I think to check back, that's the right term. Is this idea of and sometimes it might be very close, but you always need this check-back, in a sense. I think to check back it's absolutely essential. It basically means that if you train a statistical model into synthetic data, you see a correlation between A and B and you see a very strong statistically significant correlation between A and B and that would be very meaningful for what your research question is, for example. Then that's evidence that there might be a correlation, but you always need to check back again in the original data.

For training a machine-learning model, there's even potentially good machine-learning evidence that it might actually help with the generalizability of your model, yet, before deploying the model, you need to check back. You need to make sure that indeed, what the model learns is not actually an artifact of the way you generated the synthetic data, that actually the model performs well, when you put it in practice with real data. We have to train [crosstalk]-

Alexandra: Sorry to interrupt you, but did you really say that in every case you need to check back? I'm completely on the page with you, when we say, "Okay, it's a high-risk scenario, and I'm developing a model that should I don't know, tell me whether a patient is going to surgery or not. They absolutely want to make sure that the person has the type of cancer that needs to be surgically removed." If a model is trained for recommending me that whether when I'm buying a red sweater, I potentially will also be interested in a red t-shirt, do I, here, really need to check back on the original data, if it's a low-risk scenario? I don't think so.

Yves-Alexandre: That's a good question. For high-risk, definitely and I think, like when you look at the Act proposed by the Commission, there's a strong element of continuous testing, basically the performances of your algorithm. It's hard for me to say, "Oh, for low-risk, then you should or you should not." I think it very much depends. The scientist in me would always think that we should just make sure and building these feedback loops is really important. Even for an algorithm that would not have much impact, you still develop this algorithm. You still develop it for a specific purpose.

Making sure that you have this feedback loop, even if it's just a recommendation system for a sweater, might still be valuable. Again, I think mostly, and when I say this, it's really because I get very worried sometimes, on some of the discussions on synthetic data, because I'm torn, at the same time, I think it's fantastic and it's a great way to be able to share individual-level data with limited privacy risk with a large number of people. At the same time, I see reports and like, "Oh, we've been testing new drugs, correlations or something using synthetic data." I'm like, "No, please. [crosstalk]"

Alexandra: Absolutely, but here you're giving me your high-risk scenario again.

Yves-Alexandre: This is the kind of thing. It's just like, "No." To test and everything, yes, but no. If you take a scientific journal, I think this one, I'm like, "You should not be allowed to publish in the scientific journal on synthetic data."

Alexandra: Interesting. I've talked to scientists--

Yves-Alexandre: If you think of the high risk, that's the one in which you're like, "No." Do everything, validate everything, you need this feedback loop. Results should be on the real data because otherwise, you will always have the question of, is specifically what you're observing real or an artifact of the way you generated synthetic data? An artifact before you built, basically the model that generated the synthetic data?

Alexandra: Again, thinking of our customers, this is something a trust that builds up over time because, of course, initially, you always compare the results you get on synthetic data, to the production data, to the real data. If you've seen over and over and over again, that you're getting not exact same, but super close to exact same results, then at one point in time, you say, "Okay, I don't need to check every time. Every once in a while is definitely sufficient."

I just had to laugh when you brought up this example of scientific journals. I've talked with another researcher a few months ago and he shared that one of the challenges in academic research nowadays, that so much happens on data, which sometimes is only accessible to a privileged group of researchers, it's really, really hard to replicate results if you can't get access to this data. I would even argue that there is a role for synthetic data to, at least, make the data the research was conducted on, available in synthetic way, so that other researchers could try to validate and see if they get close to it.

Also in journals, since you brought up the AI Act in validation and continuous testing of AI systems, particularly in the context of ethical AI fairness, and so on and so forth, I would also see a role for synthetic data to just have validation tests sets out there, that are not derived from the original production data because many times when I talk with fairness practitioners, they have some unbalanced data because the customer structure just doesn't reflect full human diversity or particularly doesn't have enough examples of members from minority groups. If you were to create a synthetic dataset that has not even representative, but over representative examples of minority groups, you would have a better way to test and validate the systems on bias, as opposed to just using your existing dataset.

Of course, this is super early research when it comes to AI fairness and synthetic data. In general, I think there is a role for it to help with validation and seeing whether the AI system, for example, in that case, accelerates some bias or is treating people fairly.

Yves-Alexandre: I think to me, I really see the value in this data that you can basically broadly share for a range of purposes, but if I think of how you use the spectrum of tools that exist from the differential privacy, to synthetic data, to query-based systems, you can really see how they're complementary. They help you do something different. It's basically the differential private aggregates are basically something that you can broadly advertise on your website, very general like frequency tables, et cetera, that will fit probably like 80% or 90% of what people want to do with this data.

Then others, are going to want to dig deeper, look at a very specific question at a different frequency table that you've not released or run some specific piece of code or train a machine-learning model against the data. I think this is where you see the symbiotic relationship between synthetic data and query-based systems, in which basically you have the synthetic data, you can make this broadly available. People can use this to see what's in the dataset to see if it can answer the research question or if they can use this to train a first machine-learning model and develop their code.

Then once they have a good sense of what they can do, what they want to do, what the statistical test is or what the training procedure is, they can then go to use query-based system and open algorithm mechanisms to basically go to real data and basically either validate or that it has been trained properly against to real data, validate the finding that they have, to think that the fact that these two things seems to be, once you control for another one now, are really strongly correlated in the real data and basically, validate their findings against the real data using the query-based system. You really have is this.

That's how they can really work together to cover the spectrum of applications.

Alexandra: Absolutely. You, unfortunately, have to leave soon. This actually is a segue to my second to last question. Since you mentioned that there are so many tools out there, which were complementary. I'm super curious to get your thoughts and view on the state of the art of balancing AI and data innovation with privacy protection. Some say, It's too heavily regulated, we can't do anything with our data." Others say, "Well, everything is already there." Where are you on this spectrum? How do you see the future of privacy protection data utilization? Do we already have the tools to make data utilization or privacy-friendly manner happen or is there something that's still missing in a huge gap that needs to be filled?

Yves-Alexandre: That's a quite broad, tough question. Maybe I can focus on two thoughts. The first one is whether we have too much regulation of data and I don't think so. Again, I'm very biased. I'm working for a data protection agency. I do think that GDPR is getting a bad rap mostly because people don't really understand and fully use it. I do think that sure, there's places where it can be improved, but in general, it's a good comprehensive piece of legislation. It does find quite a good balance between preserving privacy-protecting data and allow data to be used.

I don't think we have too much and I do think that if you look at Recital 26 on anonymization, there is quite enough room to maneuver and find ways to use data, while preserving privacy.

Alexandra: Absolutely. Coming to the tools, do you feel that with this kind of complementary working together of synthetic data, differential privacy query-based systems and some others, do we already have everything there to make full use of data in compliance with GDPR? Do you see a gap where you see some use cases can't be realized currently?

Yves-Alexandre: I think we have a lot of the fundamental bits. We know which tool is good for which application, I think, at this point. I'm a researcher, there's always more research to be done-

Alexandra: Luckily.

Yves-Alexandre: - but we do have a good sense of the tools that exist out there and how we can use them. I think today, the main challenge is how do we communicate what these tools can do and how do we get these tools adapted in practice? I think it's much more of an engineering challenge to be able to understand what are the use cases and how can we use the tools to address this use case. Including in some cases, I think the need for building up capabilities on the side of companies and government wanting to share data, in terms of what are the tools out there and which tools can I use? I think at the moment, unfortunately, in my opinion, privacy is too much seen as a check-boxing compliance type exercise and not an area where we can innovate.

That's really unfortunate because I think a lot of good tools exist, but you need to invest to understand which tool is good for which application and how do you build a pipeline of these tools that will then be hopefully broadly reusable within the company and really capabilities you can use. I know this is a very generic sentence, but there's really a need to stop thinking of privacy as this thing that you need to fill in forms and check boxes and that limits you in what you can do and rather, "These are a lot of really, really good tools. How do we use these tools to create innovation?"

Alexandra: Absolutely. Personally, I'm happy that I live in European Union where it's recognized that privacy is an important value to uphold and that you're not sacrificing this just for data innovation, but, of course, it's also in my interest that we find a balance here because AI and data innovation is just not nice to have anymore, but something that's really important. Can I ask my very last question? I know we're-

Yves-Alexandre: Yes.

Alexandra: - a little bit at the end of the time, but I won't want you to leave before I get your thoughts on that because one thing I've been discussing quite a lot with regulators in Brussels in the past few weeks and months, is also on the one hand, when we look at the ambitions Europe has when it comes to AI and data with the European Commission striving for becoming the global leader and responsible AI, enabling widespread AI adoptions amongst SMEs and enabling a data ecosystem where it can really create value and reuse data and even encourage the data is used for AI for good initiatives. We were wondering can all of this even be achieved if we don't privilege or favor anonymization to make this possible because if you look at some member states, they're quite some high barriers to even get to anonymizing data and then using it in a privacy-preserving manner. What are your thoughts on that?

Yves-Alexandre: I think there's a need from a regulatory standpoint to look at GDPR, look at Recital 26 and potentially, try to see the extent to which data protection agencies can give more guidance on how the tools can be used to achieve anonymization. I do think that the framework of GDPR, as I said, allows you to do quite a lot. I think now really, what we need the most is guidance and kind of an update on the guidance on, now that we don't have this convenient notion of this is an attacker set of quasi-identifiers, key anonymization, it's a key of equal aids and therefore, it's okay that we have these modern, as I call them, privacy engineering tools, which combination of them is good and allows you to reach the standout set forth by GDPR for anonymization. I think it's mostly a question of, I think data, protection agencies is to give more guidance on how these tools can be used to achieve the required standard for GDPR.

I do think it's feasible. Again, we discussed this, GDPR is getting a bad rap for preventing use of data and for distribute cookie banner that you see on website, like hundreds of times a day, but it's also like, specifically when it comes to anonymization, it's actually Recital 26 is brilliantly written. It takes this negative notion of the identification. Regulators love to talk about, "Oh, this is future proof?"

Recital 26 is a good example of something that is future proof, they did not try to be like, "Oh, well, if it contains your name or your IP address or your phone number or your email, or like the hash of your email using MD5, then it's anonymous." We know this is a scenario that is going to evolve quite fast. Both on the attack and the defenses, we do not want to write down in the law what is or what is not anonymous according to today's standards. Instead, we're going to take a negative quite well thought out definition of what is anonymous, unless given everything we know today, this is not anonymous anymore.

I think that's a really flexible and smart way to think about what constitutes anonymous data and to make sure that it's useful and you can use it and use it to use data, you can use it to train machine-learning models and at the same time, it doesn't become loophole out of GDPR, as soon as you took a dataset and you pseudonymized it and then you call it anonymous?

Alexandra: Anonymous. Definitely. I'm with you with that. I also think that GDPR is very well written and in future proof, I think you named the perfect example here with Recital 26. I also agree that guidelines would be helpful for the industry. I think sometimes it's also not necessarily the text of GDPR that needs to change, but how it's interpreted by data protection authorities. They have quite a challenging task to manage. On the one hand, they need to make sure that privacy is protected, but now with the increasing importance of AI and data for economy and society as a whole and also for security democracy, if you talk to Paul Adams and so on and so forth, they just have to balance it with other rights and other needs of the society. Therefore I think, something like privileging anonymization is something I hope we will see in the future because I still hear quite a lot of barriers when I talk with customers there and how data protection authorities interact, but super curious to see the upcoming guidelines.

Yves-Alexandre: We discussed GDPR and all the good, I think if there is one thing that I would want to improve or I would think that was not the right or the best decision, was actually, in the structure of data protection authority to still have data protection authorities at a national level. I think this is something that is very linked to what we discussed. This is a quite complex fast-moving field and it's very, very difficult for data protection agencies to have the capabilities and the resources to be able to follow all of their relevant topics and be able to have the time to give opinions to regulate on a really, really broad range of topics.

I think this is something that not having a unified agency at European level properly resourced and who has technical capabilities, it was less foreseeable, I think, at the time when GDPR started, but I think now we see that, in my opinion, it's increasingly becoming necessary.

Alexandra: Agreed. Especially with the AI data, digital services act and so on and so forth on the horizon. We really need this unified digital enforcement authority, which has the resources and also the expertise and also the perspectives, which are more balanced and just from a privacy or just from an anti trust or a competition perspective. I really hope that we will see something like that in the future.

Perfect. We went already way over our time slot. Thank you so much for taking the time. It was a wonderful and a very thoughtful discussion. I always enjoy when we have the time to speak. Thank you so much for coming on the show and very much looking forward to continue this conversation at one point in time.

Yves-Alexandre: Pleasure. Thanks for having me.

Meaningless privacy guarantees vs. true privacy with Yves-Alexandre de Montjoye

Transcript

Ready to start?