How to implement data privacy? A conversation with Klaudius Kalcher from MOSTLY AI

Alexandra Ebert: Welcome to the 19th episode of The Data Democratization Podcast. I'm your host, Alexandra Ebert, MOSTLY AI's Chief Trust Officer. Today, I have a wonderful guest on the show, Klaudius Kalcher. Klaudius is one of our three co-founders here at MOSTLY AI and he's our Chief Data Scientist. I don't know many people who are more passionate and deeper into privacy protection, and also how it can be implemented in practice than Klaudius is. Today, we'll cover everything you ever wanted to know about synthetic data privacy.

For example, how data synthesis differs from legacy anonymization approaches like masking, then what actually counts as a privacy leakage, and what's only statistical inference? We'll also talk about how different privacy risks need to be addressed, and what our few and differential privacy is. Lastly, we'll speak about privacy mechanisms that we've implemented in our synthetic data platform to keep your customer data safe from re-identification. If you're curious about the mathematical aspects of privacy and would like to get a behind-the-scenes look into privacy-preserving synthetic data generation, you are in the right place.

Don't worry, even though Klaudius has a PhD in medical physics, and is an expert on statistics, you won't need a PhD to understand it, I promise, and I'll try my best. Besides talking about privacy, Klaudius also shares the story of why MOSTLY AI was founded and talks about why he's a fan of open data projects. Why he's convinced that data democratization is crucial for data-driven future. Plenty of insights and takeaways in this episode, and you definitely don't want to miss out on that one. Let's dive in.

Klaudius, welcome to The Data Democratization Podcast, it's a true honor to finally have a founder at the show, because you're one of the three founders of MOSTLY AI. As you can guess I have plenty of questions for you. First, what's your background and how did you end up becoming a data scientist? Second, where does your passion for privacy come from? Third, what motivated you to found MOSTLY AI? What's the story behind the company? Feel free to start wherever you want, but there are so many things I'm curious about.

Klaudius Kalcher: Thanks, Alexandra, for inviting me to this chat. Well, what's my background? Maybe, I start with my background is from computer science and math. I always had this interest. Then started out my studies very much in computer science, but then quickly realized that what actually fascinates me in working with computers with programming, is not the technical details of the programming, but really what we can use it for,. What we can learn, and really understanding the story of data.

I shifted more and more towards the theory and mathematics and statistics and then after finishing my studies in statistics, I wanted to go back into more practical topics, and really reap the fruits of this understanding. The topic that I jumped on from my PhD thesis was then using statistics to understand the human brain, which is a very fascinating frontier of human understanding right now so we don't fully know how this really works. That was, for me, really an excellent case [chuckles] I could apply all the maths that I learned,

Alexandra: I can imagine. To interrupt you there, what was the most exciting thing you found out about the human brain?

Klaudius: I will say the most exciting thing I found out is that our window into understanding this system is very limited. You might know what is fMRI, functional magnetic resonance imaging, pictures that you might see in media, in the literature, with this high-resolution image of the brain. Then some colorful blobs on it, where you see, oh, this region is activated here, that region is activated there. You have to realize that these function images are on a much broader scale, and you don't really see directly here is a neuron that's activated right now.

It's more akin to trying to understand a computer by looking at it with a thermal camera and then you see, well, for video processing, this part gets hotter and for audio processing, that part [chuckles] seems to get warmer. That's where we are. We see something's going on roughly over there. Something's going on roughly over there and we're trying to connect the dots. We're still at the very start of understanding the human brain but it's very fascinating to see also the very specifics of where we technically are.

Alexandra: I can only imagine and now I'm happy that our IT support actually doesn't work that way that it tells me okay, if I have a problem with my computer, yes, it's getting hotter there, but not yet sure what to do about it. Sorry, I interrupted you when you were just in the process of telling your story, so please continue.

Klaudius: Yes, sure. One of the things that was really a booster to my research was already data sharing in this domain. In fMRI research, back in the day, there was a project called the Human Connectome, which was a collection of over 1,000 datasets from over 1,000 different individuals where the brains have been scanned, and then those could be analyzed together. You could download the full dataset online and start working and analyzing this kind of data. Which normally would be extremely expensive to do because an MRI scanner costs millions of dollars.

Scanning thousands of people, you will need tens of thousands of hours to do that. Data sharing, a lot is collaboration, and really fast learning on a scale that was not there before. It showed how powerful that idea actually is, in enabling all these different researchers from around the world to collaborate on these data that are usually valuable. We had researchers from the most varied backgrounds like US, Europe, India, China, whatever, all collaborating. Not just that they had access to this data with a very low barrier of entry, you could just download them, but also, there was a common data set that we could collaborate on.

It was not just, "Oh, I found this from my data," and someone else would say, "I found something else in my data." You had a shared data set and that made it much easier to get to a common understanding of what's going on there and thus, learn about what we were all interested to learn. Already back in this time, privacy was a consideration because if you imagine this Human Connectome, that's obviously a sensitive brain scan of these individuals. The practice there was to anonymize this by obviously, removing any name or identifiers.

Also in the imaging data, removing all the content of the MRI image in the region of the face so that you could not try to recreate the face from the MRI scan and then identify an individual there. However, if you think about it, if we assume that these data of the brain are very high resolution and give a lot of information about what's going on there, if our assumption is that all of our brains are different in very unique ways, then the question is, is that kind of anonymization also future proof?

Wouldn't it be possible for someone else, maybe in the future, who has more access to more information about brains and how you can compare these activities, to then use the actual information of what's going on in the brain to identify an individual? Maybe that's just pessimistic, but it really shows that you need to think through what exactly your concept of privacy is if you're sharing data.

Alexandra: Absolutely. I think it's the responsible approach to have these kinds of thoughts, because only if it seems to be private today, who knows what happens in the future?

Klaudius: Yes, you also asked where does this passion for privacy stem from? I would say that was something that occurred to me because I was already interested in privacy before that. Why did the interest emerge? It was because looking at data and analyzing data showed me how powerful the learnings, and really the insights from data can be. That means a couple of different things. First of all, if someone has data about you and other individuals, they can use these insights and this information asymmetry to get an advantage over you.

That would be in the direction of data misuse. That's one of the fundamental reasons for privacy, that you do not create an asymmetry of information where someone else could have this power over you by misusing data. [crosstalk] What they know about you to manipulate you in any way and harm you by using this information. You don't know that they have the information. That's why this data needs to be protected but on the other hand, a secondary important aspect for me is also the data democratization. What do I mean by this?

In the market, this is called like this, but I think everyone might have a slightly different understanding. For me, one of the most inspirational statisticians of the last couple of years or decades was the late Swedish statistician Hans Rosling. Some of you might know his visualizations in Ted Talks where he really brings data to life in very visually appealing shows. Where he shows how world development data, like life expectancy, or incomes compared around the world have changed over time and how we actually, by we, I mean, people who have learned about the world in broad context in school, and that's decades ago.

Our understanding of the world is still heavily influenced by what we learned back then. The world has changed and really understanding this is really transformation. You realize that the world's actually different from what you have still somewhere in the back of your head because that's the way you learned about the world in school. The data is there and it's just about access to the data. Some part of it is data literacy, that's what Hans Rosling's talks were really powerful in doing, like making it more accessible. The second part is also enhancing what data is actually available, increasing the scope of available data.

Alexandra: I guess that has something to do with how MOSTLY AI was founded. On the one hand, your passion for privacy, and then also understanding of the potential that data democratization could have.

Klaudius: Absolutely.

Alexandra: How did it actually happen and your role and then you founded MOSTLY AI?

Klaudius: Yes, sure. MOSTLY AI was founded in 2017. If you try to put yourself in that context, that was about a year before GDPR in Europe came into force. It was already decided. The outline was there, but it wasn't enforced yet. It was also the time where the first generative deep learning networks showed some promise. There results were still very sketchy. You had images like 16x16 pixels, or 32x32 pixels really catchy, but it showed that there is some promise there. We sat together, thought about this practical problem we all knew from our previous work in data science in different companies.

That data access is complicated from a legal perspective. Even if you have the legal goal to use some data, aren't there some technical measures that we can implement to improve the security and the safety of all of this? We sat there, thought about this problem that we all knew very well [laughs]. We've sat for months in positions where we're just preparing access to data. We knew the pain, and then we figured is that something we can solve by the time that GDPR comes to force? Yes. Then we started working on it and figured it out.

Alexandra: Amazing. Pre-GDPR world, I can't even think back of that time.

Klaudius: [laughs].

Alexandra: Many things have changed since then. How did things change for you? What are the most important topics for you right now especially when it comes to synthetic data?

Klaudius: I would say the biggest changes with GDPR, but also with other legal changes and also high profile practical cases, the understanding of privacy has evolved quite a bit. Back in 2017, it was not clear to a wider public what the importance of privacy is. It was less clear what privacy actually means from a more technical point of view. Maybe to give you some context here. Many people today would talk about differential privacy as one of the main quantum leaps in understanding of privacy of the last decades.

Even though the concept was developed in 2006, it was only in 2016-17 where the big tech company started using that. 2017 was really the point where the understanding of privacy concepts was at a rather early stage. What changed since then? I would say the awareness of privacy as a big topic has increased a lot. Today the conversations I'm having rarely revolve around why is privacy important in the first place or why can't I just remove a name from the data set and call it private.

The conversations today are on a more informed technical level I would say.

Alexandra: Good to hear.

Klaudius: That is one point, and second, of course with technologies like synthetic data out there, a lot of the conversation has shifted to, how can I use this for my specific problem that I have here?

Alexandra: Absolutely. Talking about use cases, do you have any favorite synthetic data use case because it's an enabling technology that can really facilitate so many different use cases? What's your personal favorite synthetic data use?

Klaudius: Of course. As I mentioned, for me, data democratization is very close to my heart. For me, some of the favorite use cases are around publishing and open data. Some of that inand official statistics, government domain to really make data available and really available to the public. I love looking up some official statistics on topics that I'm discussing. If in the conversation, something comes up like societal policy topics, it's always good for me to look whether there's some actual hard data on that that I can use to understand this problem in more detail.

Some of this cannot be answered right now because the data's not available because some answers would need a very fine-grain access to data that's only now emerging. That's probably my favorite use case. Make it possible for everyone, every citizen to access open data repositories on a very detailed level to understand the inner workings of society [laughs].

Alexandra: Absolutely. Really access to granular data to really be able to understand humans better, understand society better and make better decisions for all of us. Thank you very much for these history lessons both on your personal history and MOSTLY AI's history. Let's come to the actual meat of today's episode. Now that we've covered the stories because we wanted to have episode on the privacy of synthetic data. Let's start with a very easy question. Why do we need to protect privacy in the first place?

Klaudius: Sure. As I already mentioned, I think privacy is very much about information inequality and therefore also power inequality. That's for me a fundamental part of democracy. If you want to be free in a meaningful way, you need to be able to protect your personal information. Something that's very, very, very personal to you from the misuse of that information by others. That sounds very abstract. What does it mean in particular? Privacy has been very fundamental from the very beginning of statistics and data usage because to be able to get to informed discussions, you need high-quality data.

To get to high-quality data, you need an honest source of data. Let me put it that way. That could be in some situations that's easy, you have a direct measurement of, let's say weather patterns. As soon as humans get involved, you need to make sure that for these people there is no risk of sharing information about them. Otherwise, you will not get quality and ultimately honest answers to important questions especially if you want to discuss any sensitive topics.

If you want to discuss health issues, if you want to discuss financial issues, then for people to share such information, they need to be really secure in the knowledge that this information will not be used in any way against them. That's why privacy is so important to have a data-driven decision-making process in any way.

Alexandra: Absolutely, and to preserve it. This just reminds me of an example I came across in a book the other day where they wanted to figure out how to measure how many people cheated to put it nicely on their taxes. How they could actually construct the questionnaire and also the situation. How they asked for this information so that the likelihood of people actually opening up and not being afraid of any consequences for them was higher. They implemented some measures that gave them the opportunity to either answer truthfully or under certain conditions explicitly not to tell the truth.

These conditions were specified in a way that later on with statistics, it was able to come back to the estimation of how many people gave the answer that potentially could do a little bit better when it comes to honesty paying the taxes and so on, so forth. I think that's a very, very good point why privacy is such an important topic. Definitely, also something where I'm personally not happy about if somebody says, " Well, why should I care about my privacy? I don't have something to hide." I think that's a topic that's important for all of us and it's not only about your individual privacy, but also the situation for society.

Now that we've answered this simple question, let's come to the hard questions. Why do you need synthetic data to protect privacy? Isn't it enough to just anonymize data with some legacy anonymization technique and delete, let's say the name, the social security number, and maybe a few other data points should be sufficient? Shouldn't it?

Klaudius: Well, [chuckles]. The obvious answer is no, it's not sufficient. I can tell you why. many of these data sets, many source of information contain very specific details that can be used to reidentify an individual. For example, if you're looking at medical histories of individuals, then removing the name, social security number does not protect anyone with any kind of medical history and the truth. That's basically everyone because some of these will be unique or maybe the combination of some of these will be unique.

Let's say there was a very high-profile case of the governor of Massachusetts, the '90s which was exposed by Latanya Sweeney, but who found the full medical records of the governor of Massachusetts in a supposedly anonymous, shared dataset by just looking up some information that was readily available in the media. I think some [crosstalk].

Alexandra: I think it was the list of voters or something like that.

Klaudius: Yes. There was some of that but if you know from the media that a person was in hospital at a particular day and there's some rough outline of what it is. Typically, if you can look this up, there's not a thousand people with a particular heart condition in a particular hospital on a particular day. Even if there were more than one, which is already rare, then if you had some other demographic data like is that person male or female? Is that person in their 50s or 60s or 70s. With that idea, then we can quickly make this record unique again, and then you know who the record is of the person you were looking for.

Then you can start figuring out all the other medical details of the individual. Even if you don't have this one single high-profile event that identifies you, just the fact that you might have had a cold and went to the doctor in this particular week, last year. Then maybe some months before you were at the dentist in a particular week, that already could be a unique combination.

Alexandra: Absolutely. I'm also not 100% able to memorize Latanya Sweeney case back then, but I also thought that initially, when she started out, she took this publicized health care information, which was supposed to be anonymous. Then found some publicly available list of voters in some regions. I think there was only one person living in the same district as the governor of Massachusetts. She showed that she could re-identify him, but then she, of course, continued to study this. I think in the end, she uniquely identified over 80% of people, just with zip code, and birth date, and gender.

It's really very limited information that today is sufficient to revert this legacy anonymization.

Klaudius: Yes, it really goes to show how you don't need a very high-profile bit of information. Just a combination of seemingly normal features could already be enough. If you want to look into that in more detail, there's a cool project I think it's called Privacy Observatory, where you can enter some very harmlessly seeming personal data. Like rough age bracket, the country you live in, et cetera, and then you realize that you're pretty unique with just the combination of a few of those.

Alexandra: Absolutely. I think one of my favorite examples is also this TED talk from Alexandre de Montjoye from Imperial College of London, where he shows about location data. He has these nice visualizations where you see all the people traveling around, I think it was Belgium, where they made this research. Then I think with only four data points, they were able to uniquely identify over 90% of people. That really makes it apparent and visual how little data is needed because especially with these location data points, telecommunication companies have dozens or hundreds of those per individual.

Just the handful, they're really enough to identify that it's one person. Really something that has to be taken serious.

Klaudius: Exactly.

Alexandra: We have established that it's important to protect privacy, but why do we now need synthetic data?

Klaudius: In the legacy portfolio of anonymization techniques there are basically two approaches. One would be to keep the original records and modify them in some ways, and the other one would be to aggregate merge the records into larger groups. I can discuss the limitations of those, one after the other. The first one where you keep the original data and just modify them so that they're not recognizable anymore has the clear limitation that if you do not know what makes this individual unique. Or if you do not know what potential other data sources someone could have to re-identify the individuals, that will always be an open door for attacks.

If that is breached, then you have the real data. You have the real individual there, you might know that some of the values might be changed. You might even get access to details about the anonymization process. If you had, for example, some noise addition, you could use that as well, to make an inference. For example, if I found the record of a target individual in so-called anonymized data set, then I would look up their income and I knew that the income had noise added with a mean of zero and a variance of I don't know, X.

Then I know that the real person's income must be within this general bound that might be already very sensitive information to have about that individual. That's obviously a huge risk of this type of anonymous data where you try to keep the original data and just modify them, but then the linkage is still open.

Alexandra: I wouldn't even say this type of supposedly anonymized data because it just is really not sufficient to use this approach anymore.

Klaudius: Yes, exactly. It's basically pseudonymized data.

Alexandra: Sorry, you wanted to continue with the second part.

Klaudius: The second part is about data aggregation. If you remove any information about specific individuals that only publish information about groups. The problem with that is that depending on how your groups are defined, you can use these groups to reconstruct information about individuals. The simplest example would be if you have, for example, incomes of groups and you only publish the mean income of specific groups. Then you might have one data set or one group that is the same as another one, but including one new individual.

Let's say the average salary in a particular company, and then one new person joined or one person left. Then you have the average salary afterwards, you can calculate the exact income of that person just based on these group statistics. Some very practical attacks in this direction have been proven by the American Census Bureau who also published some aggregated statistics. On the aggregate statistics, if they're not adding any additional safety measures, you could reconstruct many individuals in really individual data points.

Alexandra: Yes and how they influence this. I think one of the big disadvantages besides the re-identifiability of these two approaches or these two different approaches to legacy anonymization is also that both of them are destructive, and really destroy lots of valuable information that otherwise would be super, super nice to have and insightful to have. Can you tell our listeners a little bit about the privacy utility trade-off?

Klaudius: Sure. In these classical approaches, because you have these huge doors for attack, to be safe, you need to destroy basically all the information that is in the data. Any information that's left could also be used to potentially launch one of those re-identification attacks. That means the privacy utility trade-off is really bad. You have to change, destroy most of what's there because that's the only way to reduce the attack surface that you have. That is where synthetic data comes in and changes this game quite significantly.

If you use synthetic data generation for privacy, then what you do is you learn an abstract representation of the population. Then sample it from there and get completely new individual. There's no one individual that can be linked to a real-world individual. There's no groups that you can use to mix and match and produce some specific key overlapping groups that will allow you to draw an inference because those individuals in the static set do not correspond to real individuals. There's no overlap that you can create to single out anyone in particular.

That also means that the privacy utility trade-off is completely different because what you need to do then is you need to reduce the individual's specificity of this learned representation. It needs to be really only about the population statistics, but those can be learned in a high level of detail as long as it's something that's a general feature and not an individual characteristic. Then you can create synthetic individuals that have all of these population characteristics, all of these relationships with a very high level of detail, but without the privacy risks that you had on if you would like to have this high level of granularity by destroying some of the data.

Alexandra: Yes, absolutely. I think that's really one of the reasons why we've seen so many organizations now really focusing on synthetic data because it allows you to securely protect privacy and be safe from re-identification risks. At the same time, have super valuable supergranular data that you can use for your analytics, AI training, testing purposes, and so on and so forth. I think that's also because when we started out, you mentioned that the awareness of how important privacy is has changed from when mostly I was founded to what we see now in the market, especially when it comes to legacy anonymization.

I can remember the conversations that we had with global enterprises, but you really use legacy anonymization approaches and left too much information intact and we're not aware about the consequences that this could have for an organization. Now they became much more aware of this started deleting basically all of the information they have, and then, of course, ran into the state of fuel crisis when not enough data was available for the tasks that needed this information. I think also GDPR, of course, played a role in this because we know that GDPR explicitly strengthens the requirements for anonymous data.

That it really should be both for an organization as well as for third parties impossible to revert and to re-identify this data with current technical means, but also with future technical means. I think this is also one of the big contributors to why privacy is taken seriously now and why so many organizations rely on synthetic data to really fulfill GDPR requirements and requirements of other privacy legislations. One thing that I think would be interesting to cover in our episode as well, we know that for default synthetic data that is generated with our platform is fully anonymous.

We also have a bunch of legal and technical assessments attesting that our data is impossible to re-identify and thus exempt from GDPR, CCPA, and other privacy legislations. Is all synthetic data automatically private by design? What does someone who's creating a synthetic data generator, so a model to generate synthetic data, him or herself need to keep in mind to make sure that the data this generator produces is truly anonymous synthetic data.

Klaudius Yes. That's an important question and it really goes back to this discussion. What level of abstraction is your model able to learn? There is the key to this. You need to train a model that learns the abstract pattern of the population, but does not memorize individuals. You need to make sure that your model can not overfit to specific individuals, not for what I would say main part of the distribution, but also not for outliers. You really have to make sure that your model is safe against any kind of memorization of input data.

We've seen open-source generators being attacked. In some ways, I can discuss some of these attacks scenarios, data in ways where you could use the synthetic data sets generated by them to make some inference about some real-world individuals. You already see from this it's already a bit of a different conversation done with the classic anonymization, like linkage. Then you find an individual here it's really about, can you extract with some probability, some information that could be related to some individual? What does it look like?

For example, if you have synthetic data generated that would be overfitted to a particular dataset because then potentially train a predictive model for a target attribute. If that is overfit to the synthetic data set, you would still be very close to highly individualized predictions. Then to give you a more specific example, if I were to try to find out someone's income based on such a dataset, then I could train a model. Then get a prediction for the income of a person that is very similar to the characteristics I know about the person.

Maybe I know their age, I know the gender, I know where they live, et cetera, and I have this range of attributes that I know, and then I would ask them all, "What's your prediction for an income for such a person that would look like this?" That's the profile that I know about the person. Then if that prediction is more similar to the actual individual that had this profile, then it would have been if that person was not part of the training data, then that is an information I should not be allowed to get out of this dataset. That's the risk if you have this over 50 types of models.

Alexandra: Yes, absolutely. To sum it up for our listeners, it's not enough to just synthesize data. You really have to have strong privacy mechanisms in place to be sure that the data is compliant with GDPR, CCPA, and other privacy legislation. Also that the privacy of your customers is protected. You mentioned a few privacy threats already. What are the actual privacy threats that synthetic data protects against?

Klaudius We discussed some of the old school privacy threats, they're the high-risk ones I would call them where you find an individual in a dataset that has been de-identified, but you still re-identify the individuals, you know all about them. Or you link that individual's record to some other data that you have and then you have a lot of information about them that you shouldn't have. All of these scenarios, the reidentification, the linkage attacks, the singling out of individuals they are basically off the table when you talk about synchronization because the synthetic data will be completely new individuals.

Then the discussion shifts to what I would call the more manageable risks and that is two in particular. One is attribute inference. The example I gave before, is there some way I can train some kind of model on the synthetic data that will give you more information about the target attribute? Let's say income about someone I want to query. Then the information I would have had if that person's data has not been used in decentralization process. The second one is called membership inference, and the question here is, can I learn from looking at the synthetic data, whether a particular person has been in the training data or not?

That could be relevant if you, for example, look at a company's data of their customers. If you can figure out whether a particular person has been in the training data, you can infer that this person is, or has been a customer of that company, which could be already some sensitive information, right? Those are the attacks scenarios that we think about when discussing synthetic data. What's good about them is you can quantify them. If you can quantify them, you can also take measures to reduce risks to basically zero.

I mentioned, for example, the target state would be that cannot see whether or you cannot infer whether a particular individual has been in the training data or not. You cannot learn anything about that individual that you wouldn't have learned if that person has not been in the training data for synthetic data. If that's the case, then you can clearly say there is no leakage of information from individuals in the data into the synthetic.

Alexandra: To make this easier to understand for our listeners, would it be a correct example if we would say, for example, a person is 92 years old. The model predicts the gender of this person or something like that and would predict that this person is a female. This prediction should not change based on my actual 92-year-old grandma being in the data dataset or not. It's just because it picked up the patterns that usually females get older than males and therefore came to this conclusion, independent of migrants or not being in the dataset or no.

Klaudius Yes, exactly. Maybe to frame the attack scenario in more specific ways. In this scenario, I don't know the exact numbers, but let's say you have a 75% chance of being female and 25% of being male if you're a 92 years old, I don't know. That's just your best numbers. If I had a predictive model that predicts the gender of a person based on their age or whatever other attributes on the dataset, those are the numbers that I would get. Then that's perfectly fine, that's the characteristic of the population.

If I could improve my prediction of whether the person is male or female, I think some other attributes that I know that do not under statistical level have that information. Let's say, I add the eye color and hair color of the person that maybe the street they live in. Then I get a 80% chance that she's female versus 90%, then this increasing probability could give you a window about this specific person, and that's what you want to protect about.

Alexandra: I understand. Talking about statistically inference, what actually is privacy leakage and what isn't? I'm asking because we sometimes have these conversations and also see it in the privacy research that there is some heterogeneity in how people understand this. Can you elaborate on this a little bit?

Klaudius Sure. Sometimes people come from this understanding that if in a synthetic data, you see an individual that looks very similar to a real individual, that could be a privacy indication. Of course, it looks like there could be something, but whether there is a privacy leakage depends on whether this particular constellation of attributes is something that you would expect anyway in the data. Very interesting example here, it came from an efficient dataset we did with the City of Vienna, where they did some privacy evaluation by checking whether individuals in synthetic data looked in any way similar to real individuals.

Alexandra: What type of dataset was it?

Klaudius That was demographic data, based on some census information so you would have moderately granular information about where the person lives. Like which city block is it? Let's put it that way. Then birth date, date since when they are registered at that particular address, where they're registered in gender. Occupation, I think was also part of it, or at least in broad categories. They looked at synthetic individuals that were similar to real individuals and the most similar ones.

Individuals where I guess your estimate of what the attributes of that person should be, would be just as accurate as the ones who had synthetic data. For example, you had in a residential district, and in the block of that district, there was a six-year-old person. That six-year-old person was also female, and that person did also go to school and that person also had an income of zero. That person was also in the relationship status, single, and never married.

Of course, the likelihood of finding individuals with this combination of attributes is there just because that's what the population looks like, and you have to have these individuals in all the city blocks. Does it leak information about any actual individual? No, because you never know whether there's really a six-year-old there or maybe in the real world there's, I don't know some five-year-old, seven-year-old children in those city blocks. Depending on the size of those, you might have quite a few that look similar. That's where you see the similarity doesn't necessarily mean privacy.

Similarity can only be related to privacy leakage if it is specific in the sense that if that particular person hasn't been in the training data, would you draw the same conclusions? In that case, six-year-old is never married and income zero that's the conclusion I would have drawn with exactly the same accurate even if the person that looks similar, the real person wasn't in the training data, because that's just the pattern. On the other hand, if a specific outlier, let's say a 101-year-old person in a particular city block. If I would have a higher likelihood to have a person of the same gender or whatever other attributes are, then just from chance and the general distribution of the population, then that could be a leakage. That's why you need to specifically target privacy evaluations.

Alexandra: This is also why we have this rare category protector in place so that really the extreme outliers like the very few people that are even 110 years old, for example, would be excluded. Other than that, you would get a very granular distribution that's statistically similar to the actual distribution, let's say that city area. One thing that I wanted to clarify, would it also be correct to assume because you mentioned this, I would find a synthetic data set and get the impression that I actually know that six-year-old who is still single and doesn't have an income and lives in this city block because quite likely, there are a few of them living there.

It's just so high-level information, not so specific information that the likelihood of knowing somebody who would fit this characteristic is rather high, and therefore it wouldn't be actually suitable to identify this person. Something like this could happen but if you have properly generated synthetic data where you generate not only this handful of attributes you described like income, relationship status, and so on and so forth but hundreds of these attributes. It wouldn't happen that you get somebody who matches in all the categories and you would really be able to identify this person because exactly have all these privacy mechanisms in place.

Klaudius: Exactly. In this particular case, we had a birth date and a date since when they were registered, this individual, at this place also not exactly matched. There's another six-year-old that has a different birth date. The pattern that was learned is that years since when or the dates since when they were registered at that particular address was very close to the birth date. That's a pattern from a population if a person six-year-old, it's not unlikely that they have moved in their life. You get these patterns but not the more highly granular specifics if you get more detailed information like the exact day.

Alexandra: Absolutely. I think it's such a beautiful use case because, just a little bit background for our listeners, the city of Vienna is really one of the most innovative cities when it comes to data. Also, just the ambition to be one of the leaders in the governmental space with AI. They have this open data initiative ongoing and they really looked into synthetic data to find a solution where they could safely make granular synthetic data available to researchers, startups, small and medium enterprises to really also, as you mentioned, democratize the access to data.

I think it's a super useful technology because before synthetic data, all of these requests they received had to be approved and also anonymized on a case-by-case basis, and of course, they weren't able to share much more than just a few data points per individual. Here, it's really possible to share a complete picture. Maybe not from the city of Vienna case but from another public sector organization. Can you elaborate, Klaudius, on this one example of how information-rich synthetic data is with the married couple and the child that it was I think insurance, or maybe you can--

Klaudius: Sure. That was in health insurance where we had data set about population on insurees and we were able to generate a synthetic population of Austria that had some really highly detailed patterns being learned. For example, we had a synthetic individual who was female from a particular region and I think 26 years old something and then even relationships with other individuals in the dataset had been created. This person had a husband who was also from the age point of view, matching pretty much what you would expect a similar age and date of marriage.

Then a child that was born in the same year of this date of marriage. All of that fit together pretty nicely without any of this information being similar to any real individual in particular, but the pattern would be instantly recognizable as a plausible pattern of a person's life.

Alexandra: Absolutely. I think what was so surprising for us was really that, I forgot the actual name of the community where the synthetic individuals, the married woman and the married husband came from. When we evaluated the data, we suddenly saw that the child was born in a different community. I think we found out that-- Maybe you can--

Klaudius: I can give you that. That was one very interesting aspect of this. We didn't even know about this before looking at this particular example in this dataset but both of the parents in the family were born in a particular municipality in Austria. The child was born in a municipality that sounded similar but was actually different. I think it was Kohlberg in [unintelligible 00:44:57] and Kohlberg [unintelligible 00:44:58]. Sounds similar but where the change came from. Then when looking up this particular municipality, it turns out that in the 2010s, two neighboring municipalities have been merged into one municipality and that is a new name.

In synthetic generation this synthetic child has been born after this date, it had the new municipality name as the birth date, not the old one, even though the parents had the old one. The model even learned that in that particular region, there has been a change in the names of these municipalities.

Alexandra: I think that's an amazing example of how domain-agnostic generative AI is for generating synthetic data and really picking up all of these insights that are present in the data but can't be safely uncovered due to privacy reasons. I think that's also one of the big, big benefits we see with synthetic data compared to legacy anonymization, that you don't have to have this knowledge. That you're really able to synthesize a complete data set and get all of the information with all of the insights that you're not yet even aware of being present in the data set.

Versus legacy anonymization, you have to think on a case-by-case basis, what is the absolute minimum information, what are the absolute minimum fields that I need to answer a specific question? Then try to distort mask of your state and the rest of the data in those fields in a way that you hope that privacy is safe. Which is on the one hand, not safe, and on the other hand, super, super slow and really one of the reasons that these data access requests can take many months in large organizations and really slow them down.

Klaudius: I will also highlight that, imagine you're building a data application for this dataset. You're trying to build test cases, edge cases like this will never think about such details but the elegance here is that the model learns automatically so you don't have to think about all of these different specific edge cases. That might be publicly available knowledge that these municipalities have been merged but as a person manually creating the test data, you'll never think of including all of this.

Alexandra: Absolutely. One other question I have for you, Klaudius, is you mentioned differential privacy beforehand and that this was also really a paradigm shift in how we think about privacy. That actually one individual should not influence the result of a query to a specific data set of information that you find available. What are your thoughts on differential privacy and also differentially private synthetic data?

Klaudius: I would say that differential privacy has really transformed the way that the privacy community thinks about the concept and how to measure privacy. Previous approaches to privacy like k-anonymity have had a very different outlook on privacy in that they even try to measure how big of a group any particular information relates to. Whereas differential privacy has really changed that to a measure that I would say now has a broad consensus of being their meaningful measure. That is what influence does one individual have on any particular algorithm or any outcome.

Alexandra: A quick question before we dive deep into differential privacy. For all of those listeners who are not as passionate about privacy and measuring privacy as you are, can you quickly explain k-anonymity so that it's easier to understand, and distinguish between this old paradigm of thinking of privacy protection, measuring? Then what's changed with differential privacy? What's k-anonymity?

Klaudius: Sure. K-anonymity means that you don't publish any number that does not relate to at least K individuals. For example, if you had a population count of a particular district, you only publish that number if at least K people lived in that district. Anything that relates to a smaller group would be censored.

Alexandra: For example, with health diagnosis, if I have a super rare disease or if only like three people in a dataset have the super rare disease, it wouldn't be published but if there are 10,000 people who suffered from the flu in a year, then the group is large enough.

Klaudius: Yes, exactly. If it was something rare then you wouldn't say flu but you would say some kind of respiratory illness, for example, so that the group would get larger by this broadening of categories.

Alexandra: I think we understand. What's the flaw of k-anonymity?

Klaudius: The flaw is that k-anonymity actually still exposes individuals in the dataset to some kinds of attacks where you can reconstruct the original data. This has been shown for example, by the US Census Bureau. Their k-anonymous publishing were in some constellations just combining different groups, different definitions of the population could lead to an attack where individual information could be singled out and reconstructed from the merging of different k-anonymous data sets.

Alexandra: I think there were also these examples, for example, if you had k-anonymity. For example, the salary of people within a company and sorted by age groups, and then some people get older and move to the next age group. Then somehow you can reconstruct salary of specific individuals.

Klaudius: Yes, exactly. In the extreme case, if a group changes only by one person and you can see the difference of the result between these two group statistics, you basically have the value that this one person contributed.

Alexandra: Okay, makes sense. Thanks for explaining, let's maybe come back to differential privacy. What is differential privacy?

Klaudius: Differential privacy takes a different view by measuring the maximum influence that any one individual could have on a particular outcome. The outcome without the individual needs to be more or less the same as the outcome with individuals. For example, whether you participate in a study or not, the results of the study should be more or less the same. This more or less is specified with a particular privacy bound that's related to a parameter that's typically called epsilon.

Roughly speaking, the maximum change in a likelihood of certain events can only change by a small factor by including an individual into the data that was not there before or removing them.

Alexandra: A practical example, we referenced my grandma already earlier in the episode. It shouldn't make a difference on the prediction, for example, of an algorithm whether my grandma was actually present in the data set or not, no matter whether her data was in there or not, you would get the same answer that people over 80 or over 90 are more likely to be female than male and it shouldn't shift only because my grandma who is over 90 and female was in the dataset.

Klaudius: Exactly. In the context of synthetic data, there's a couple of things that I think are noteworthy. There are some theoretical and some practical limitations of the differential privacy framework when you think about synthetic data. Maybe I will start with some theoretical one. Please bear with me here for some 20 seconds. I'm using some mathematical jargon just for those people who know this to understand what I'm talking about.

Alexandra: Yes, I will stop the clock. 20 seconds of jargon you will get but not more.

Klaudius: In probability theory, you have a probability space that constitutes of a sample space, event space, and a probability function. A differential privacy only concerns itself about the probability function, so basically, what's the probability of certain events. However, the sample space itself could also leak some information depending on how you construct it in the process of that data creation. That's something that's completely out of scope of differential privacy. Therefore, if you want to have privacy in that data, you need to think about how you construct the sample space as well.

Alexandra: Can you give us a more tangible example what this means?

Klaudius: Yes, exactly. That's so much about the jargon. The more tangible example, if you have, for example, categorical variables. Let's say, a person's zip code in a machine learning pipeline, this would be encoded in some numbers and this is called the sample space. What it means is that whether a person lives in a particular zip code is encoded in one dimension of the data, that's basically zero if the person does not live there, and one if the person lives there.

The fact that there is this dimension already tells you that somehow, in constructing the probability space, you knew that this dimension is relevant. That it's a relevant aspect of the information in the data set whether a person is living in this particular zip code or not, or having a particular job title or not. If that information is something that could leak information, for example, your data set did include a category President of the United States as job title, then this could leak information.

Or an attacker could infer that, oh, perhaps the President was part of your original data. Perhaps that's because they're one of your customers. This type of information leakage needs some separate protection measures. In our way of thinking, this is implemented in different ways depending on what kind of variables there are. for example, for categories, it's a rare category protection that is there on top of any learning.

Alexandra: Understood. You also mentioned that there are some other limitations to differential privacy. What else comes to your mind?

Klaudius: As I mentioned, all the guarantees in differential privacy are framed in the context of a parameter that's called epsilon. This epsilon is basically a factor by which a risk can be increased or a probability can increase when a single individual is added to a dataset or removed. In practice, how much more likely is an algorithm to guess your grandma's gender correctly if they were in the training data as opposed to if they were not. This example, also, it relates to what we discussed previously this attribute inference risk of this particular attack.

Now let's look at what this at a different epsilon could mean here. Typically, people would recommend to have as small of an epsilon as possible. In practice, you will get more lax recommendations, like an epsilon below one is okay.

Alexandra: Yes, I think, most of the researchers list it. One that I've seen, Academia recommends that epsilon one or below that is the safe way. Is this is correct?

Klaudius: Yes, exactly. Even the epsilon of one means that the probability of guessing this attribute right in an attribute inference attack could increase by a factor of up to three because the risk increases exponentially with the epsilon. In practice, we see even larger epsilons than this being used, like an epsilon of four or eight has been floated around.

Alexandra: Yes, I think there were these newspaper articles a few years back when Apple and big other brands started using differential privacy, that they marketed as some kind of guarantee to privacy which differential privacy is. Of course, with the high epsilon, these guarantees became more or less meaningless.

Klaudius: Exactly. Specifically, an epsilon of four means that the likelihood of guessing a particular attribute right in an attribute inference attack is increased 50 fold, 50.

Alexandra: That's significant.

Klaudius: That's quite large. An epsilon of eight could mean it's increased 3000 fold. That is not exactly the level of privacy guarantee that you want to have for your customers, right? What it means is, it's not that the privacy of the real data point is necessarily at that much of a risk, but the guarantee doesn't tell you how large the risk of a particular individual in a particular data set actually is. All it has is that, with the algorithm that you've got here, theoretically, there could be a data set, if you apply this algorithm on it. That there might be someone who has this level of increase of risk.

For us, this means that the guarantee is not exactly what the privacy practitioner wants as a reassurance. What's more relevant to them is what are the risks to the actual individuals in that particular data set. What's the risk if I apply this algorithm on this particular data set to anyone in it? Not necessarily to any potential dataset to which the algorithm could be applied. To measure this, you don't really have a theoretical framework within differential privacy to measure specifically this.

What would be helpful here, of course, from a research perspective, is to have ways of sharpening this privacy guarantee for specific or under specific constraints. For example, you would say, "I'm only considering a population where no person has a negative age, or has an age over 200, or whatever." With these specific constraints, you could create that in some way, are there stronger guarantees for the individuals that fall within those assumptions about the variables and that becomes pretty strong.

Alexandra: This is why we are, as a company, a fan of the concept of differential privacy, but due to the limitations that we see with implementing differential privacy, we like to think in terms of differential privacy when looking at privacy aspects. Really try to make sharper guarantees that apply to a specific data set and really provide something that's useful in practice and when working with this data.

Klaudius: What we always try to do is, with the privacy framework, that is thinking from differential privacy in mind, we try to measure what the maximum influence is that any one individual could have on the training of the synthetic data generative model. If that influence is very low, and here we can set much, much lower thresholds than you would have if we would use classical differential privacy guarantees, we could really say that influence of any one individual should be practically zero.

Alexandra: Yes, makes sense. I think this really is a promising approach and really helps to produce something where practitioners can have certainty that this is privacy-safe and can be freely shared. As mentioned, with differential privacy, yes, you get the mathematical guarantee, but only having differential privacy implemented doesn't guarantee that the privacy actually is well protected. As you've just elaborated on the epsilon and how high it was set, that has a very strong influence of the actual meaning to privacy protection.

Klaudius: Exactly, yes.

Alexandra: Wonderful. Cool, thanks a lot for sharing all your insights on privacy aspects of synthetic data and also differential privacy. We've talked for quite a while. That's come to the end of this episode. My last question to you for closing out the session was actually, where do you see synthetic data being applied in the future and where is it not yet in use?

Klaudius: For me, I would envision in the future being able to collaborate more on open data sources. Many data sets are still very much restricted in access because of privacy reasons. Not because the data owner doesn't want to share the information in there on the general patterns, but because of privacy reasons. If we can open up that and make data more accessible. Also moving back to what I said at the beginning about data accessibility, data democratization, and really having all those smart brains around the world have easy access to high-quality data.

Then build a data-driven future with it, that would be where I see synthetic data in the upcoming years and decades.

Alexandra: I can only support then definitely also a future I want to live in and I want to see. Once again, Klaudius, thank you so much for taking the time to come on the podcast. I always enjoy having conversations with you and the thoughtful way that you answer everything. I think it was truly insightful for me. I also hope that our listeners learned a lot about privacy. Of course, if they have any open questions then just reach out to us at the podcast. Thanks again, Klaudius, it was a true pleasure.

Klaudius: Thank you, Alexandra.

Alexandra: Wow, what an insights-packed episode. As you know, at the end of each data democratization podcast interview, we collect the most important takeaways for you our listener. Let's bring it all together. What were the main points that we all should remember when considering privacy from a data science perspective? First, that the data sharing is important, especially for research and for social progress. Open data projects are a cost-effective way to access data and to provide a common understanding to researchers.

Of course, synthetic data is a crucial tool to enable this data sharing because it ensures privacy protection. Second, privacy risk can be understood as an asymmetry of information and privacy protection should be about protecting individuals from the exploitation of this asymmetry. Number three, how businesses understand data privacy evolved significantly since the advent of GDPR. Conversations are now more informed and have turned towards the solution to privacy problems. Number four, informed decisions need high-quality data.

To ensure high quality and also truthful data, people need to be certain that the data will not be misused. This is especially important when collecting data in sensitive areas like healthcare or finance. Number five, legacy data anonymization techniques are insufficient for protecting privacy. In the legacy portfolio of anonymization techniques, there are basically two approaches. Either, number one, you keep the original records and just modify them. The limitation here is that if you don't know what makes someone unique, there will always be an open door for an attack.

Even with added noise, sensitive information will be leaked that can lead to re-identification. The second approach is you aggregate multiple records into larger groups. You remove information about individuals and publish only data about the group or on group level. However, even with aggregation, you can reconstruct data about individuals. As you can see, none of these two approaches are privacy-safe and both researchers, as well as the many infamous examples from the business world that we've seen, like the Netflix privacy breach case, have shown that both approaches can be de-anonymized.

What's more, due to the, say not particularly good or really bad privacy utility trade-off of legacy anonymization, you really destroy the utility of datasets, which makes them more or less useless for advanced analytics, AI, and many of the more sophisticated data tasks. Number six, coming to synthetic data. Synthetic data generation is a sophisticated approach to data anonymization that keeps the privacy of your data subjects perfectly protected. One of the big benefits of synthetic data besides being way more accurate than traditionally anonymized data is that the traditional privacy risks of those legacy anonymization approaches are not present in synthetic data.

Only more manageable ones like attribute inference, or membership inference. A good quality synthetic data generator like Mostly AI synthetic data platform takes care of those manageable privacy risks with all the additional privacy checks and features like rare category protection, prevention of overfitting, and everything that's necessary to really generate high quality, but also fully anonymous synthetic data. Finally, number seven, let's summarize what we know about differential privacy.

To put it very simple, differential privacy is the idea that one person should not influence the result of a query on a specific dataset no matter whether this person was actually present in the dataset or not. How large or small this influence of this specific individual is can be measured and can be communicated in a mathematical privacy guarantee. Watch out, guarantee in that case doesn't mean that it actually guarantees privacy protection. Like Klaudius explained the epsilon value is crucial to determine how large the impact of said individual is.

While academics recommend an epsilon value below one, what they found when looking at well-known businesses like Apple and others who use differential privacy in practice, that many of those have epsilon values of 4, 8, or even 11, which definitely can't be considered a strong or even sufficient privacy protection. Only because it says differentially private, don't believe that it's automatically protecting your or the privacy of your customers. Nonetheless, differential privacy is a very important concept that led to a paradigm shift in how privacy pros think about privacy.

While we at Mostly AI have not formally implemented differential privacy as a concept, it's at the core of how we think about privacy in relation to synthetic data generation. That's it. That were our main takeaways from this wonderful episode with Klaudius. Of course, as always, if you have any questions or comments about the data science perspective on privacy and specifically about synthetic data privacy, just drop us an e-mail to podcast@mostly.ai. Until then, see you next time.

How to implement data privacy? A conversation with Klaudius Kalcher from MOSTLY AI

Transcript

Ready to start?