Fair synthetic data and ethical algorithms – the fairness conversation with Paul Tiwald, Head of Data Science at MOSTLY AI

Jeffrey Dobin: Welcome to the ninth episode of the Data Democratization Podcast. I’m Jeffrey Dobin, privacy expert and lawyer from Duality, a privacy-enhancing technology company that helps others collaborate using homomorphic encryption and data science. I’m joined by my talented cohost, Alexandra Ebert, the Chief Trust Officer at MOSTLY AI, the category-leading synthetic data company. As you’ll see in just a moment, this episode is a little different but also a lot more special.

Alexandra Ebert: Today, Someone from MOSTLY AI will join me on the podcast. Paul Tiwald is leading our amazing data science team, but not only that, but Paul was also employee number one within the company who joined MOSTLY AI right after it was founded. Paul is originally a theoretical physicist with enormous scientific knowledge and data science experience. He was also the brains behind the idea of fair synthetic data, and this is what we’ll focus on today.

Jeffrey: I remember Paul speaking as a guest at an earlier meetup late last year about fair synthetic data. One of the topics that came up in that meetup and is getting a lot of attention right now in the media is bias and ethical AI. There’s a lot of attention on this for a number of reasons, as we’ll see in this episode. I can say Paul is one of my favorite Mostlies, we go way back, and I can’t wait to see how this episode unfolds. Alexandra and Paul, over to you.

Alexandra: Hi, Paul, it’s great to have you on the show today. You’re the first MOSTLY who joins me for the podcast. Today, I would love to talk about ethical AI and fair synthetic data with you, but before we jump into fairness, can you share with me and the listeners how you ended up at MOSTLY AI?

Paul: Yes, sure. Thanks, Alexandra, for having me on the podcast. It’s great, and I’m happy to share my story, the story of ethical AI, with all our listeners. How I came to MOSTLY AI, that’s a pretty straightforward story, I would say, because I was working with one of the founders, Roland, together at a company. As soon as MOSTLY AI was founded, I switched over to MOSTLY together with Roland, so I was the first employee, and it has been really a wonderful journey and great fun to see MOSTLY AI grow from starting off with four, so three founders plus me, and then now coming up to, I think, almost 40 people currently.

Alexandra: Yes, that’s really impressive, and I think you even hold the record for the most consecutive months in a row of employee of the month from the beginning.

Paul: Yes, that was not too much of a challenge at that time.

Alexandra: You’re still doing great, so no worries about that. One thing that I really enjoy about being part of MOSTLY AI is that everybody within the team is just so passionate about our mission. Can you share with the listeners what makes you most passionate about working for MOSTLY AI?

Paul: Sure. I would say it’s at least two motivations that drive me the most. On the one hand, there is the privacy topic and the privacy reasoning. What we do makes intrinsically sense to me. What we do taps the incredible resource of data that you need in order to make data-driven decisions. On the other hand, we provide a tool or the means to do that in a privacy-preserving way because, like many other people, I have a mobile phone, and as such, I’m a customer of many apps and companies that produce those apps.

In the end, I would like to help them to offer me a better service and to improve their product, and derive findings from the data that I give them. That’s the one side, but on the other hand, I don’t want my privacy to be, preached and exploited, and I want transparency. I want to know what I give to them and what they do with the data.

Our product fits in there nicely because it allows for tapping this incredible amount of information in a privacy-preserving way, offering transparency to customers, for example. That’s the one part, and the other part that also motivates me is the technology behind it. We use artificial intelligence, as our name implies, to create synthetic data. The artificial intelligence field is growing, and the progress is immense, and the community is very cool and very open. Just from the research perspective, it’s really fun to work in that, and the speed of the development is really motivating.

Alexandra: Yes, I can second that. It’s always amazing to me what you and the data science team are working on and all this progress that’s made in that regard—really cutting edge what you are doing there. Also, what you described as the first point, I think that’s the current zeitgeist of people really expecting great products, personalized services, but on the other hand, also having this increased awareness about privacy and how important privacy protection actually is for them. Therefore, I also think that synthetic data is a great tool to make both things possible in the economic world, also from a research perspective.

When we started out with MOSTLY AI, you mentioned it already, the focus was privacy-preserving synthetic data and really having this tool here and bringing this to production that allows organizations to reconcile data innovation and privacy, but then fairness entered the game, and you had, I think it was back then in Christmas vacation period 2019, the idea of fair synthetic data. How did you get this idea, and can you explain to those listeners who have never heard of fair synthetic data what the concept is about?

Paul: Sure. Yes, you’re right. It was Christmas break 2019. How we came up with this idea was essentially a book co-authored by Aaron Roth. His name is very present in the privacy domain. He co-authored with Cynthia Dwork at least one paper on differential privacy, so that’s what he’s heavily involved in. He co-authored the book The Ethical Algorithm. Before this Christmas break, I saw the announcement of the book on Twitter, and I ordered it. I was reading it, and that was the first time I came into contact with the topic of fairness. Chapter one of the book literally is privacy; chapter two is fairness.

The thinking was immediately triggered, “Okay, now that we have privacy, what about the other topics?” Obviously, there are more chapters in the book that we also might want to explore in the future. Fairness was chapter number two, and we started discussions internally if we could not only create private but also fair synthetic data. The general idea behind it is that we synthesize the data. We create them from scratch, and we create them with artificial intelligence, so it must be possible to form and shape the data to our needs or to the needs of our customers. When we do that, why not build fairness into the algorithm to have not only private but also fair synthetic data?

One important concept regarding privacy is that we solve the privacy issue at the root. When you use older or classic anonymization techniques like masking, obfuscation, and so on, you have to do that throughout the pipeline, and you have to keep track of all the changes and all the points or all the people and organizations you share the data with.

When you do synthetic data, you solve the problem at the root because there is this barrier between the original and the synthetic data. There is no direct link between data subjects in the original and synthetic data. Once you have synthesized the data, there is no way that any original data will be changed or shared with any other party. That’s what I mean by solving the issue at the root.

In any downstream task, you don’t have to care about privacy. You can just take the synthetic data, train your machine learning model on it, and your machine learning model will be private by design because it only received private synthetic data. The ultimate goal for fairness is the same. We would like to solve the fairness issue at the root. You synthesize the data, it is private, it is fair, and in the downstream task, you do not have to take care about fairness because it’s implicitly in there. Your machine learning model will be fair with respect to the fairness definition that you applied during the algorithm. That’s the concept behind it and the beauty behind it: to have the possibility to shape and model the data in a way that is needed.

Alexandra: Since you mentioned modeling the data in a way our customers need it, I would even put it one step higher and say modeling the data in the way you would like to see the world. In a more balanced and non-discriminatory way. Because we all know that society and humans are biased to a certain extent, and then, of course, all data sets reflect these biases, so I think it’s a super promising concept. For a business that hasn’t investigated the challenges of the bias in algorithms so far: what makes it so challenging to get bias out of an algorithm if you don’t already do it at the root level as you described by inputting fair data? What’s so challenging about removing bias?

Paul: That’s a good question. First of all, it’s hard to detect bias, and unfairness can easily come through the back door. You won’t expect it. There are blue-eyed solutions that just won’t work. The best example is when you say, if I want to make sure that we don’t discriminate against people of a certain ethnicity, for example, then we just drop the column ethnicity or the feature ethnicity from our data set. But unfortunately, that is not helping at all because all the biases that are in the data are heavily interlinked with potentially any other column that is in the data set. For example, being of a certain ethnicity increases the chance of you living in a certain area. You drop ethnicity, then you have to drop zip code, and the list goes on. It’s essentially impossible to disentangle bias.

Alexandra: I think there was also this paper or several papers showing that these proxy variables could still introduce bias. The recommendation was to leave the sensitive attributes in there because it makes it at least easier to counteract and correct.

Paul: Exactly, that is the general notion that having this column in place actually gives you a handle to mitigate unfairness. I would say that’s the biggest challenge. As you mentioned before, this historic bias that is in the data reflects that our society is as it is. We call these biases.

Unfortunately, they’re deeply rooted, and they come through many back doors. I think that’s the trickiest part. That’s the reason why we need algorithmic fairness and proper fairness definitions to at least approach the problem. To find solutions step by step.

Alexandra: Fairness definition is another good keyword. What is actually fair, and how can you define it in a mathematical way? Why do you need to define fairness?

Paul: That is a question that is really hard to answer. There is no unique fairness definition that everyone agrees to, to put it frankly. There is no one and only suitable fairness definition. That’s the most pressing question in the fairness research currently. There are plenty of mathematical definitions that can be used and that can be integrated into algorithms. Some of these definitions even contradict each other. So, if you have fairness definition A and fairness definition B, you can mathematically prove it, but you cannot fulfill both of them, for example.

Alexandra: You have to decide.

Paul: Yes, you have to decide. In the beginning, decision-makers have to sit together and solve every individual problem. Define or agree on a fairness definition that is appropriate for the use case and the problem they want to solve. There is this beautiful example of a two-year-old and a four-year-old fighting for six pieces of chocolate. The two-year-old would say: “We share 50/50, three pieces each,” but the four-year-old might say, “I’m two times as old as you, so I deserve four pieces, but you only get two. I’m also bigger, and my body will take twice as much.” Yes, both are valid points of view.

Alexandra: The second one reminds me a little bit of my brother, who was always triple hungry and tried to get as much food as possible.

Paul: Obviously, a bit oversimplified, but it illustrates nicely how different views change the definition of fairness, what is fair, and how it should be implemented.

Alexandra: That’s also a challenge we have on a societal level. Different cultures might have different definitions of fairness. It’s both a challenge for ethical and fair AI, but also a chance because it forces us as a society to have this discussion on a detailed level as we might never had it before. To really think about what outcome would we consider fair and what wouldn’t be fair to then be able to implement it in the algorithm.

Paul: You’re absolutely right. Having this discussion might bring different cultures together in the end. Now, this is very high level and ideological. It may take a long time. Sitting together discussing those things, and then from a scientific point of view, putting this into the language of math is going to be an interesting and nice challenge, and nice to see what the outcome will be.

Alexandra: Absolutely, we’re all excited to see how this discussion will continue, and of course, we want to shape it and provide input. When we published our fairness series on the blog, it was a few months after you started the research of fair synthetic data. You showed that you could pick a fairness definition and then create synthetic data that satisfies that fairness definition. I think the interest in fair synthetic data and its role for ethical AI can also be seen by all the magazines like Forbes and IEEE Spectrum, and I think Andrew Ng and, now recently, the ICLR AI conference picked up this topic. The interest in fairness and ethical AI is growing and growing.

One question that we sometimes get from reporters, back then, you worked with the US census data set and showed that you could create statistical parity and an outcome where the income from female data subjects is equal on nearly equal to the income of males. Why can’t you simply take a data set and give every female person in this data set, $20,000 raise per year to fix the problem? Why do you need this complicated process of fair synthetic data?

Paul: As we mentioned before, the beauty in fair synthetic data is that we can synthesize the data and shape it the way we need it to. When you naively increase the fraction of female high earners, then you destroy the quality of your synthetic data. You introduce new biases because you cannot – or at least a human cannot -have all other features or cannot monitor all the other features and estimate the impact on all the other features that such a transformation will do. Essentially you might introduce new biases. You definitely deteriorate the accuracy of your synthetic data. This will harm and degrade the performance of machine learning models downstream.

Alexandra: To give our listeners a practical example, imagine I would give every female $20K raised per year but not changing the type of car that they drive, their shopping behavior, how much they spend, and so on. It wouldn’t make sense, and it would destroy the utility of the data.

Paul: Exactly. Also, regional aspects. The people in the countryside typically earn less, and then suddenly, there would be very high-income females living in the countryside. That’s just not plausible. When you do it in an algorithmic way, you make sure that you keep those balances, and you keep those correlations intact while balancing the aspect of high earners between males and females.

Alexandra: Yes, so you’re basically optimizing for two things. One thing, as always with our technology, accuracy, but also, in this case, the fairness that is satisfied so that you have super useful but still fair data.

Paul: What you just mentioned is exactly what we do from an algorithmic approach. We train generative neural networks, and one objective is to have the highest accuracy possible with the privacy constraints built in there. But then we had the second objective, which is the fairness loss. Both are optimized together in a joint way.

Alexandra: That’s really a super interesting concept, and we see growing demand from the market, not only in the private sector but also in the public sector. On the one hand, sharing data that can be used by SMEs, by startups, but on the other hand, ensuring that their perspective of fairness or the society’s perspective of fairness is satisfied—really looking forward to the adoption of this and how this will evolve.

All the talk about ethical AI nowadays with the new proposed AI regulation on a European level also highlights and emphasizes how important it is to have fair and debiased datasets. Do you have any practical tips for businesses that want to move towards a more ethical AI and data strategy? Any practical tips for data scientists? What should they do in their day-to-day practice to develop more ethical algorithms?

Paul: That’s a pretty tough question. I think the most important step that can be currently taken is to create awareness in the data science teams, also be aware personally, and read about the topic. There are tools coming up that help you analyze your data, see if there are biases in there. Still a lot of manual work involved, but you get the feeling. As I said, being aware and informing yourself is essential. The field is still new, and there is a lot of things going on, and a lot of change is involved. That would be my number one piece of advice.

Alexandra: Yes, that’s a super important point. What I also find promising is all those organizations that have installed ethical AI committees within the company that not only consist of data scientists but people from different departments. Ethical AI committees sometimes also include other stakeholders, the customers, and so on, have them involved in the product development or algorithm development process right from the beginning. I think ethics is just such an important topic that not one single person should be responsible for it. It’s super important to have a broad and diverse group of people engage in this discussion and be aware of it.

Paul: Exactly, it’s similar to fairness. Fairness as a sub-topic of ethics says different people have different points of view, have different experiences, and every input helps. Every pair of eyes and ears helps to be aware and to be alert to contribute, balance, and catch unfairness and biases in the data.

Alexandra: We should definitely ask our listeners to educate themselves more about fairness, start researching this topic more after they’ve listened to this episode because the more people can join the discussion, the better our chances as a society to really use AI in a way that benefits us as humanity, but also is fair according to definitions and understandings of the biggest possible group of people so that it’s really a more inclusive discussion. I think this would be nice to see.

Paul: Yes, definitely. I can underwrite that.

Alexandra: Wonderful. Then thanks a lot, Paul, for your time, looking forward to seeing this episode go live. Talk to you soon.

Paul: Yes, thanks. Goodbye, everyone.

Jeff: It’s fascinating to catch a behind-the-scenes glimpse of the work that goes into creating fair synthetic data, and there is a lot to unpack here. Let’s review the most important takeaways from today’s conversation.

Alexandra: Sure, Jeff, I’ll start. Once our team at MOSTLY AI tackled the issue of privacy and could synthesize privacy-preserving synthetic data that made training machine learning models private by design, we decided to create the same for fairness: to make machine learning fair by design. Fair synthetic data can make machine learning models implicitly fair with respect to the fairness definition you applied to the algorithm.

Jeff: Why is it challenging to eliminate bias? Removing features like ethnicity doesn’t eliminate the bias. Due to the presence of proxy variables, algorithms can still become skewed. Plus, if you remove all sensitive attributes, it destroys data utility. It’s similar to the challenge around privacy. You could remove personally identifiable information from a dataset, but that doesn’t make it private. Non-direct identifiers are still there, and that’s leaking privacy.

Alexandra: In fact, problematic features like gender and ethnicity should be kept in the data set. According to the latest fairness research, they help in identifying and mitigating the bias. We need algorithmic fairness, and that brings up a very important question. How do you define fairness?

Jeff: As Paul mentioned earlier, there isn’t one standard definition of fairness. It can mean different things to different people and different organizations. That means that we need to define fairness for every specific problem and use case separately.

Alexandra: Once you have a definition that works for that specific use case, you need to synthesize data to satisfy this fairness definition. The beauty of this approach is that it keeps correlations intact and removes bias simultaneously, so you have super useful data that is fair by design.

Jeff: In other words, you can synthesize your data to optimize both for accuracy and for fairness. The first step of this process is to create awareness, and that’s part of what we’re doing right here with this podcast. Whether you are a data scientist or a business decision-maker, you need to be aware that bias can be introduced in your AI systems and that you must proactively address it head-on.

Alexandra: Awareness is indeed one of the most important ingredients. I’ll just add diversity and inclusion. It’s important to have these discussions about fairness and AI ethics with a broader group of people from different backgrounds. One approach we see in practice, which looks quite promising in organizations, is installing permanent ethical AI committees consisting of a diverse group of people. These AI committee members are involved in digital product and algorithmic development projects right from the beginning.

Jeff: I really enjoyed this conversation, and I hope all of our listeners did too. I encourage you each to reach out to us with your opinions or your questions about fairness. This is a hot topic and something we’re all passionate about talking about. Specifically, if you want to get any questions answered by Paul or the team, please send over a voice message to podcast@mostly.ai, and we’ll get you an answer from one of the best fairness folks on the planet. See you next time.

Alexandra: See you!

Fair synthetic data and ethical algorithms – the fairness conversation with Paul Tiwald, Head of Data Science at MOSTLY AI

Transcript

Ready to start?