Synthetic data beyond privacy: data augmentation powered by AI

[00:00:09] Alexandra Ebert: Hello and welcome to the Data Democratization Podcast. This is already our 40th episode. To celebrate, I invited a special guest for you MOSTLY AI's Chief Product Officer, Mario Scriminaci. Today, Mario and I are going to talk about the future of synthetic data. If you're a regular listener of this podcast, you are well aware that synthetic data is one of the most promising and powerful privacy protection technologies out there. What is in the future of synthetic data? What lies there beyond privacy?

If you're curious about that, this is the episode for you. Some of you might even remember the key takeaway of last year's Gartner's AI & Big Data Summit, where Gartner's stated that increasingly the most valuable data will be not the data that we collected, but the data that we are going to create. This is what synthetic data allows you to do. In this episode, you will learn more about AI-powered data augmentation with synthetic data, smart imputation, and how synthetic data can help you to increase the diversity that you have in your data.

For example, we are rebalancing, we're using different generation modes when creating and generating synthetic data, but of course, there's much, much more in the future of synthetic data. I couldn't resist asking Mario for his predictions in which direction the synthetic data industry is moving. There will be plenty of things to take away from this episode. With that said, I would suggest we dive right in.

[music]

Welcome to the Data Democratization Podcast, Mario. It's such a pleasure not only to have another mostly but particularly you on the show as our Chief Product Officer. Before we kick off today's discussion, as always the first question is, can you introduce yourself to our listeners, and also share a little bit about what makes you so passionate about the work you do at MOSTLY AI?

[00:02:09] Mario Scriminaci: Thank you, Alexandra, for having me. Finally, I'm here since the-

[00:02:13] Alexandra: We made it.

[00:02:13] Mario: We made it finally. Let me introduce myself for everyone, I'm Mario Scriminaci. I'm an Italian guy that lives in Milan, Italy even if I was born and raised in Sicily. I always worked in the computer science, so I'm a technical product manager. My background is in data science. I started as a computer science back then. I started in academia at first, but then I moved to the industry because I really believed it was the best way of combining my machine learning skills and AI skills to real problems with real people.

Since at the beginning, I actually collaborated with a lot of B2B organizations, and we're trying to use AI machine learning to improve the life of the customers. I focus mainly on recommended systems and personalization system for a long time. My goal was to know as much as possible about end users, trying to understand what they like, what they're doing with the platform. Then customizing their experience not only in the selection of the right content in the right moment but personalizing the world user experience.

That of course was always a trade-off between knowing as much as possible what we were doing at the same time protecting their privacy. I was struggling and trying to find the balance between the two aspects, but there was no balance. Now that I actually on the other side helping industries, of course, organizations to try data through synthetic data, of course, I know the answer. Back then it was difficult to actually identify what was the best way of doing it. Two years ago, actually, a little bit more than two years ago, I started investigating on privacy-enhancing technologies and I stumbled on synthetic data.

Generative AI synthetic data is to be the best solution for me. I actually started using MOSTLY AI and why I was using MOSTLY AI, Tobi, the CEO call me and without knowing it was the perfect thing at the perfect time to join the MOSTLY AI family. I really believe that generative AI synthetic data can improve the way any organization use data out there while preserving the privacy of people. Be sure that you extract as much as value as possible from data while preserving the information and keeping the customers safe.

[00:04:40] Alexandra: We can consider ourselves lucky that you had to go through these frustrations of getting access to granular data, and facing the privacy challenge because otherwise, we wouldn't had you. I'd say awesome colleague you proved to be over the past two years. Today I invited you because I wanted to ask, to basically discuss with you a very exciting new field in the space of synthetic data. As you know our listeners are well aware of synthetic data and the essential part at place for privacy protection making data accessible. Today we want to talk more about this new era of generative AI for tabular synthetic data. It's a very quick high-level introduction. How can our listener think of this and why is this so exciting?

[00:05:23] Mario: If you think about it for years we had in our hands a model that was capable of understanding so well a customer base to be able to recreate the customer base by creating synthetic subjects. The reality there was only a thin layer of the possibility of having a generative model. If you think about all the recent updates on generative AI, thanks to ChatGPT, and Stable Diffusion, and everything that is around the possibility of having a model which is so strong.

You actually can create new value starting from the information that you actually build in the model. This is also true for our generative AI model. I think generative AI synthetic data can thrive in the sense of creating new value for our customers with different possibilities from data augmentation, data diversity. I'm sure we are going to discuss those topics in a moment.

[00:06:25] Alexandra: Absolutely, we will do that. Since you just mentioned it's similar to ChatGPT, I assume some of our listeners are now wondering it's MOSTLY AI now providing some chat functionality like the media darling, ChatGPT has become over the past few months. What exactly is the difference now from our approach in contrast to ChatGPT, and maybe also some of the shortcomings that are already widely discussed in media with ChatGPT?

[00:06:51] Mario: If you think about ChatGPT is a very clever interface to a very powerful model, and is the capability of actually creating a generative AI model in a text form. Now of course, with the evolution that we had, thanks to GPT-4, the multi-modality, it would be even possible to interrogate the system not only with text but even with images and so on and so forth. The reality is that, of course, we are not ChatGPT, we are not exposing the same interface, but we are actually exposing a different value for the end customers.

One of the main problem with ChatGPT being based on a very large language model like GPT-4 in this case, is that the knowledge that is in there is undisclosed to you. You don't know what are the sources they've being used. You don't even know when actually the model is providing some truthful information or something that is actually false. There being a lot of, of course, comments and in general worries around the industry about many cases where ChatGPT was actually providing a false statement or a false story or false actually indication of what the reality at the end of the day.

What we want to avoid is to having this black box that is unknown in the terms of what is the source of the data and the truth of the data, the ground basically, and the outcome. Of course, there is a lot of potential of using an interface like that, but human supervision would always be needed. Anyone that is using ChatGPT right now can understand what I'm speaking about. The supervision must be there and is of course really important. Think about now applying all this generative AI to your customer data.

You would like to generate something that you can reuse for training your machine learning algorithm more for creating dashboard, data analysis, or even to use data for, I don't know, providing some explainability to your model and so on and so forth. It would be fundamental to be sure that whatever the generative AI is actually outputting is coming from real sources, it's coming from real data. That you can control it, you can know about it.

The difference between generative AI started from very large language models, or even the generative AI built on big major networks is that you will actually be the ones to decide what goes in, and you will be able to control what is the output. MOSTLY AI is actually providing you the possibility of having a generative AI tool in your organization that is built on your own data. You'll be definitely sure about the outcomes.

[00:09:47] Alexandra: It's basically then also targeted towards humans and not so much about machines to help them better understand what they have in their data and make more of the data that they already have?

[00:10:00] Mario: Definitely, if you think about it right now most of the time every time you want to discover a new user behavior, a new possibility of our product. We can't to look at the raw data at the end. High-level aggregation helpful, but the end of the day a user session is seeing what they are actually buying, is seeing the raw data to help you to understand what's really happening. Thanks to generative AI synthetic data, and thanks to MOSTLY AI you can actually see the raw data without actually looking in the homes of the people that they are actually using your service.

[00:10:36] Alexandra: That's definitely a benefit. You already mentioned it at the introduction of these generative AI features. What specifically can our listeners think of when we talk about rebalancing data, which is one of the new features? Why is this important? Which problem does it solve?

[00:10:55] Mario: One of the main problems while doing data analysis or data discovery is that you cannot change the real data. The data is there. You can try, of course, to create core users to understand their behaviors and to understand and try to imagine how they will behave in the future. It's still very limited because you really cannot change the reality. Once you have a generative AI model that is actually understanding the correlation between all the user information attributes, and of course behaviors they have.

What are the business tool that are connecting all the different variables, you can actually ask the generative AI to change the distribution of the customer base. Imagine that you, for example, wants to create-- what if scenario when you want to analyze how my customer base will look like if I change, for example, my target age groups. You might want to verify if I would like to target older people or younger people, how your customer base will look like in terms of, for example, economics, in terms of child risk and so and so forth.

Thanks to rebalancing, you can now do that. Because once the generative AI learn what are the business rule based on the real people, the real data, you can actually ask to create a new population that will respect the distribution that you want. You can shape your synthetic customer base by asking generative AI to rebalance, for example, age groups or gender, create more diverse data at the end of the day.

[00:12:37] Alexandra: Makes sense. Putting myself in the shoes of our listener's maybe, for some this sounds like having the AI-generated crystal ball in your pocket. How can you actually come up with new data? Is it creating data out of thin air and then analyzing this to figuring out if it's profitable to enter into new customer segment? How does this all work in more detail?

[00:13:01] Mario: Of course, it is not. Any generative AI model will never create anything from thin air. Even the most famous generative models that are there, they are trained on real data. They are basically reprocessing the information that you put in kind of a magic way from a user perspective, but at the end of the day it's recreating and restructuring information they already have. The same goes for the data augmentation and data diversity features of our platform.

You actually learn without the correlation based on information you have, so you can take, for example, a minority and sample the minority or your dataset, but it will always be impossible to generate data, for example, core users that you never see in your data. It's possible to, of course, from Ledo to create a big sample, but it will be of course impossible to build from nothing something that the model will never see.

[00:14:02] Alexandra: Understood. Basically for this what if scenario that you described earlier to understand how would my customer churn look like if I had let's say more 30 to 40-year-old customers and you currently only have let's say 20%. You wouldn't come up with completely new additional customers, but you would learn from the 20% that you already have, and create let's say 60% if this is the what if scenario that you want to explore.

[00:14:27] Mario: Of course, but you will actually exploit information that is even present in other core of users. If you are going to, for example, from 20 generate 6% of user in a specific age group, you'll actually exploit the information that you have about the other age groups, for example, about country distribution. I don't know other variables like income classes and so on and so forth to create representative subjects. It'll be 60% but it would be representative of your customer base with a specific age group in mind.

[00:15:03] Alexandra: Okay, I understand.

[00:15:03] Mario: That will be able to write something that the human eye will not be able to do or to understand at the glance, but of course having a model that understand all the correlation between all the attributes you make it possible.

[00:15:17] Alexandra: This is then also the answer to why it wouldn't be sufficient to just in your analysis assume that you had more of the 20 to 30 year olds, but it's really helpful to have it in plain granular data. Because it's obviously not only the age that this cohort has in common, but there are many different cohorts in this age cohort of, I don't know, people who are outdoor lovers. People who love fine arts, fine dining something like that, and all of this will be learned from the other entire population of your customer base and therefore will also change many other aspects which might be relevant for you what if scenario?

[00:15:52] Mario: Exactly, and that's the beauty. Be able to actually exploit the world correlation information that you have in your customer base to do this what if scenarios with just the click of a button.

[00:16:05] Alexandra: Then you also mentioned another topic in the context of rebalancing which you know I'm very passionate about, which is AI fairness and diversity and data. Can you dive a little bit deeper into by rebalancing can help us in the context of fairness and diversity of data?

[00:16:21] Mario: Of course. I'm really passionate about understanding how generative AI synthetic data can help and in fairness into consideration. Now of course there is a huge need for any organization out there to be sure that whatever the outcome they are actually putting out from any machine learning task or analysis is actually fair for the end user.

I believe that of course that there are things that generative AI synthetic data can do and things that I believe we should not do or is not in the realm of possibilities. One of the things that definitely general generative AI synthetic data thanks to rebalancing data augmentation is definitely the possibility to provide more samples of the minority classes.

Definitely, that is one of the biggest obstacles when you are actually analyzing data. Most of the time, you don't have enough samples to reason on data, to actually understand what are the characteristics. What are the elements that you have to understand and write, and they will actually have important to whatever model by analyzing the future waiting so on and so forth. That is essential.

Any data scientists while analyzing a model to be fair needs to actually understand how the customer base is behaving and what are the parameters in information. On the other end, of course, we have to be sure that we keep those biases in the data if there is any. Otherwise, the data scientist will not be able to understand them and correct them eventually in the final model. There had been situation where we-- thanks to rebalancing.

Not only rebalancing but also destroying the correlation between, for example, the gender and income classes that is very famous and a fortunate bias that we have in many datasets. By breaking the correlation we were able to train the downstream task that was actually fair in the respecting of classifying for example for credit score. There'd been situation where this actually didn't improve the fairness metric that we select in the downstream task.

Because it was not possible for the downstream task to actually understand the bias and correct for the bias. Again, rebalancing can help you to actually have more samples of the minority classes whatever the dimension is. It could be ethnicity, it could be gender, it could be income, it could be geolocation, it could be anything, and help you to actually understand better the data and explore those biases. Be sure that you take consideration of them while creating your downstream task.

[00:19:04] Alexandra: Sure, so definitely interesting even though you mentioned it's, of course, not the silver bullet solution to AI fairness since this is just such a complex topic that you need to take it into account through every step of the AI development and deployment life cycle. One other aspect when we talk about the narrative AI for tabula synthetic data is actually the concept of smart imputation. For those of us who are not data scientists and have never heard of imputation, what is this actually and why is this such a big pain or the need to imputate data such a big pain for many data scientists in their day-to-day work?

[00:19:41] Mario: When I was younger and I was working as a data engineer and managing missing values and none values was one of the biggest pain for me. Because there were situations where, of course, the dataset for its nature was having missing values. One of the reason was that for example specific attributes were collected after a specific time, and there was no actual information about that specific attribute for the older population of the dataset.

[00:20:11] Alexandra: Sorry for our listeners just so, for example, if a new feature was introduced which collected additional data, and this wasn't introduced-

[00:20:18] Mario: Exactly.

[00:20:18] Alexandra: -like one year back, so you hadn't.

[00:20:21] Mario: Imagine that you change even the signup form of your service, and you now are starting collecting-- I don't know what is the obvious of the customers of the age or whatever information that initially was not there. You will have half your customer base with that information and it's out of the customer base without. The same goes even if the customer doesn't want to provide the subject, doesn't want to provide that information to you.

It could be possible that for any reason I don't want to expose my age. I don't want to tell you my gender, I don't not want to tell you where I live. This is, of course, is the freedom they used to do that, but the same time limits the possibility of doing analysis on your data. Court analysis, data exploration so on and so forth are tricky because there will be a part of the population that will be unknown for you in that specific dimension that you're analyzing.

Imagine adding 20% of the people you don't know the age of those people. Imputation comes in hand because it's a set of techniques that help to fill those values in the data set. There are different techniques for the less advanced like using the average values for numerical values. For example, imagine that all the people without that age are going to put the average age there, so we are going to have-

[00:21:48] Alexandra: Have lots of people who are 45 or something like that.

[00:21:52] Mario: Exactly. Very skilled distribution around the average, or there are more advanced techniques like integration or even machine learning models that can help you to fill those gaps. The problem in any case that you will need to focus on data engineering task even before doing your data analysis. Whatever the solution that you're going to use is not sure that will actually be representative of the real population. Thanks to the new feature that we had, we call smart imputation.

You can actually not only create a synthetic dataset that is privacy free in terms, of course, you not be able to reconnect to any of the original users, but it'll actually be imputed also. You can specify without the columns that you know there are missing values that you want fix. The model will use the correlation of all the other attributes to create a synthetic data set that will not have missing values there. Whatever analysis you will do, start from synthetic data, you will not have the issue. All the attribution that you want to do will be related to the right age groups.

For example, imagine that the age is still our variable that we want to fix, and that will be super cool. It's actually super cool. We demonstrate by actually creating dataset where officially we're removing this information by not just using random distribution, but actually selecting specific core to users that the model was actually able to reconstruct the original distribution. Thanks to the correlation available in the other gradable.

[00:23:31] Alexandra: Very interesting. Basically, this means now that the smart imputation feature can quickly with the click of a few buttons solve a problem that in the past I'm not sure how long this type of work took you as a data scientist. Is these hours, is this many minutes and just very annoying? What's the time saving that can be achieved to something like that?

[00:23:54] Mario: Definitely one of the most common statistics about the time that is actually spent in data processing is more than 50% of any ML project that you are doing. 50% of a machine learning project without even talking about using the right model, and tuning the model, understanding the results is a huge amount of time. This is why, for example, feature like rebalancing, smart imputation can really speed up any machine learning project that you have in your organization.

[00:24:31] Alexandra: Makes sense. I think it's also a nice example to show how AI can actually support humans in their day-to-day job, because obviously you still need the human expertise to know which null values should be imputed. Because as you gave this example in one of our earlier conversations with birth dates and death dates, the death date having a null value most often is a positive thing because it means your customers are still alive. Versus a null value in the birth dates is something you might want to fill out to have more meaningful analysis, so quite interesting.

[00:25:05] Mario: This is the beauty of generative AI in general. That is the capability of actually injecting the domain expertise in the model to make better synthetic data in our case but in any case better results. If you think about it right now people are using ChatGPT or any other generative AI out there by injecting domain expertise, and asking the system to create results for them. The same goes with the general AI synthetic data. You can inject the expertise you have about your customer base and make this synthetic data better than the original ones.

[00:25:43] Alexandra: Absolutely. This also reminds me of the Gartner Data & Analytics Conference I attended in the US, I think, August 2022 it was where they talked about synthetic data soon being more valuable. Since it's data that you create, that you pair with your human expertise, and thus it can become more valuable as the real data that was initially collected.

I think this is where we are now entering into and where we can see some of the very nice opportunities that open up with that. One thing that we haven't yet talked about is actually a feature I find very, very interesting which is generation mood. For everybody who has never heard about generation mood, what is that and why do we need it?

[00:26:30] Mario: Generation mood is one of the newest feature that we added to the product and is actually one of my most favorite one. Is the capability of the generative AI to generate data that looks more creative, or more conservative, or representative? If you think about it so far the goal for MOSTLY AI was to create synthetic data that were as representative as possible of the original one. In terms of catching all the distributional, for example, a varied distribution correlation between attributes so and so forth. The same time of course respecting the privacy.

With generation mood you can actually ask the generative AI once the model is being trained to generate data more in the outlier region. You'll have more diversity in the data because you'll actually sample for more diverse cases. The important thing is that the concept of outliers or diverse cases is actually already learned by the generative AI, because it will actually know creating this multi-dimensional space where those outliers stands by.

For example, if you have only people in the classical 30 to 50 range age group, everything that is lower or upper ideally could be considered an outlier. Then imagine that you have multiple variables understanding what is an outlier, what is creative in the dataset start to became difficult, is the combination in age countries. The combination of age country, and maybe gender information. Everything really depends on the regional distribution.

Unless you analyze one role by role it'll be very difficult to understand what is creative and what is conservative. Our model does that, it already knows that, so you can simply configure the generation to be more creative, and you'll have more people in the outlier region. Why this is cool? For example, you can bullet test your machine learning model by presenting cases that maybe they are very likely to happen, but you don't have in the original data set. You can do the same by of course doing vice versa, I mean, more conservative data.

[00:28:47] Alexandra: Sorry to interrupt you before we come to more conservative data. I want to definitely ask a few follow up questions on the creative generation mood. Basically it's then helping with the robustness of your model since you can feed in examples that weren't part of the original training data. Again putting myself in the shoes of our listeners, you talked about that now synthetic data, the privacy protection, the anonymization technology can be used to create outliers. Isn't this a big privacy red flag, and how can you actually make sure that this doesn't lead to any privacy infringements?

[00:29:23] Mario: Of course, the privacy mechanism are always higher levels whatever we are doing in data generation. Our privacy mechanism is ensuring that our model will not learn any specific information that can help you to identify individuals. This doesn't mean that you cannot have outlier regions in your space. It means that will be more minorities. Think about minorities. Being part of a minority is not a per se a privacy evaluation if it creates one in the minority group. This is what the creative generation mood is doing it's taking the minorities of your data set, those outliers that are not single individuals, of course, but is a region of the space of the possible values of the variables, and taking more samples from there.

Actually, it will not be a privacy problem because it will still impossible to rearrange defined back to single individuals. Actually generating more outliers in this case will have a benefit in terms of privacy because you will not have the possibility even more to identify people because you will have more samples than the original data set. Again, anything that is related to privacy is actually built in during the training time. It would be impossible for the model in any case to generate someone that was present in the original data set.

[00:30:55] Alexandra: Makes sense. It's always anonymization and privacy-preserving pretty fold, and then you can start to play with it. How creative can you actually go? You mentioned that it would take the minorities, and then start to create additional examples. Would you then also see some combinations that weren't part of the existing date? How far off can it go with the creativity? Could you get, I don't know, a 90-year-old college student or something like that?

[00:31:22] Mario: Of course, it can happen, that it's actually the goal that you can generate something that was not in the original dataset at the beginning, and it really depends on the boundaries of your variables. You could have, for example, an idea is college student, yes, most likely would be possible. Could you have a five years old with a PhD?

It's unlikely because there is a strict boundary in the distribution of the age, for example, PhD students. Again, this type of understanding of the distribution for us is easier to reason on these higher classes, but it's very difficult to reason about all the combination correlation. The model can do that. It will be the model to understand what are the boundaries and be closer to the boundaries, and not creating crazy values that doesn't make sense to see in the data set.

[00:32:27] Alexandra: Understood. Again, it's generative AI supporting us and helping us to, for example, make a model more robust in feeding examples that weren't part of the training system but training data set but are realistic enough so that they could happen in the real world. It's not the goal to go overboard and have some very, very strange values like, I don't know, customers who were born in 1820, and two-year-old PhDs and 150-year-old college students or something like that.

[00:32:56] Mario: In the future, we are going to extend those values, and actually it's coming to the process soon, and we will have a crazy type of generation model that will be even more creative in the matter. Then of course there might be crazy things coming out, but yes, it will be still in the ability of the data owner-- the ones that is configuring the synthetization to be sure that you are using the right level for your purpose.

[00:33:26] Alexandra: Crazy data generation. There should be another person saying data could be boring. Very, very interesting, very much looking forward to this. Maybe also since our listeners always enjoy stories and something from the real world, can you think back to some interactions with customers, why they are interested in some of these features? Is there any story that you can share with us?

[00:33:51] Mario: Actually, I had a conversation with a customer today, so it's a very fresh story, with-.

[00:33:56] Alexandra: Very nice. Hot of the press.

[00:33:59] Mario: Hot of the press. They were trying to understand how to exploit synthetic data, again, to be sure the machine learning models they were creating were actually possible to be tested in different scenarios. One of the problems was what is the limit on using real data, anonymized data, or even classical rule-based synthetic data. Data they were actually scripting and using for bulletproofing those machine-learning models. The problem is that, of course, for them creating data is actually introducing bias even in the process. Because every human, of course, when thinking about extreme scenarios will start from their experience.

They will understand what is feasible, what is realistic, and what is not. They will create data based on what they've seen. We as humans are full of biases in the way we judge the world, in the way we see it, is how our brain is actually built. They were actually very struggling understanding what is the best way for us to combine synthetic data, augment the real data by introducing this script-based synthetic data that we're doing. Luckily, they're actually starting the last version of the product soon. We are actually going to help them by generating those creative synthetic data and integrating their pipeline of evaluating machine learning models.

[00:35:38] Alexandra: That's definitely a nice story to share. Just remembered, I still need to ask you, of course on the other side of the spectrum, which was the conservative generation mood. For whom is the conservative generation mood the right choice for those who don't want too much excitement in their data?

[00:36:02] Mario: I would like to say for the boring people, but that is the trouble of course.

[00:36:06] Alexandra: [laughs] No, that's not the case.

[00:36:08] Mario: The reality is having data that is actually sampling from the majority of the cases can help you to stabilize specific pipelines, for example. Imagine that you have a deficient system or whatever you want to bulletproof in doing long run tests or be sure that, for example, your testing data while doing stock testing or so on and so forth, focus on the real cases. You want to be sure that there are no crazy combination or even noise that might be even in real data. If you think about it, while we are doing data collection, noise is part of the process. In those cases, we are actually collecting information that might be wrong.

By creating synthetic data with a conservative generation mood you will be sure that you are going to remove any possible noise, and you will have 100% business rule adherence and so on and so forth in the outcome data set. It's not just of course to remove any type of noise that the generation process can introduce, but is mainly for removing the noise that could be in the regional data too. It's not about privacy, of course, because those mechanisms are built in and so on and so forth, but it's about removing noise from the regional data. Think about any process that can benefit from there.

[00:37:29] Alexandra: It's more on the testing side and building systems, less on the analytics and learning new aspects about your customers. Is there also an analytics use case for that?

[00:37:40] Mario: Definitely. No, I think it's more for testing and to build models than actually to extreme testing scenarios and so on and so forth on data analytics. It's definitely for testing purposes and be sure that whatever pipeline you are building is the noise, and you are sure that you are not introducing a kind of noise to your final model or to your test.

[00:38:09] Alexandra: Understood. I think we already gave a very good introduction to generative AI for structured synthetic data. Both on the data diversity side with rebalancing and the generation feature that you just explained, as well as on the data augmentation with smart imputation. The question for me that remains and that I want to discuss with you for our audience is, is this just the beginning of generative AI for synthetic data? What is to expect in the future? Is there anything you're particularly excited about in this space, or what's still possible out of the horizon of generative AI?

[00:38:45] Mario: There are a lot of possibilities in the future. I'm super excited. There is of course two streams of evolution that I see in front of us. One is to actually exploiting everything around generative AI and the potential of models that understand a customer base like I was saying at the beginning of the podcast. Here we have a lot of possibilities of creating what is scenario simulation of user behavior for the future, so on and so forth. Exploiting this part of the capability of the model is what excite me in terms of possibility from a business perspective.

On the other end, there is the data democratization that is still our main goal. We want any organization to be able to try through synthetic data. I think our mission and mission as a company doesn't stop on the data set creation. We have to actually be closer to the value of synthetic data. We have to be closer to the data consumption of synthetic data itself. Anything that will speed up any process from creating your own dashboard and connecting your API tool directly to MOSTLY AI.

To be sure that you can exploit and use synthetic data in your tool while you're doing a machine learning development or so on and so forth it's part of what we have to do. Because right now one of the things that we are seeing when you are doing a machine learning project or you're doing data analysis project, connecting to the data can be one of the main challenges. Because of privacy, because of infrastructure, because of an infinite number of problems and barriers that you have in front of you. By cutting the risk and cutting the barriers we are not only helping our customers to using the data, but to actually get closer to the objectives.

[00:40:41] Alexandra: Understood. Again, follow-up questions for the sake of our listeners. Synthetic data removes the privacy challenge of sharing granular real-world data. Why do you need to be even faster in accessing data, in getting access to the synthetic version of your production data-wise is not working right now, or what would help to make it even faster?

[00:41:04] Mario: This is a problem that is actually being addressed from different initiatives, from data mesh, data fabric that are happening right now in organization. Unfortunately, data right now lives in silos and lives in silos for many reasons. It lives in silos because privacy lives in silos because of other architectures. It lives in silos because different teams are collecting different type of information.

The goal, for example, of modern data architecture is to be sure that the data get closer to data consumers. If we by creating privacy-safe synthetic data we can cut this time and by actually let them and collaborate through our platform and accessing data to our platform, it will actually even reduce this time. It will be even better. People will be proud to share synthetic data they created with others, and people will start collaborating with synthetic data. That's I think it would be the enabler or any use case that you might have.

[00:42:08] Alexandra: Makes sense. What we also saw with some other customers who are already starting out with synthetic data kind of internal marketplaces, just so that the visibility of the available synthetic data assets goes up. Versus only having one synthetic data set in one part of your organization for one use case obviously doesn't unleash the same amount of innovation as if you democratize access to synthetic data.

I can agree with you, definitely an exciting road ahead both for us and obviously all of our customers who are democratizing access to data. Coming back to the first point that you mentioned, which was what if scenarios and simulations. This is something that's already currently being done in the analytics space. What is the exciting part that generative AI and synthetic data could bring to this in the future?

[00:42:56] Mario: I think it's the same difference within, for example, classical anonymization techniques and synthetic data, or adding data aggregation to remove information, generalization, and granular-level synthetic data. When you are doing simulation, what is scenarios with the classical analytics techniques like predictive analytics or so on and so forth, you typically reason at aggregated level. You try to understand based on desired outcome, how your customer base could behave, or there's risk of a user to have churn risk and so and so forth.

It's impossible to actually have granular-level data that you can actually reason on and see goal by goal. This is the difference with simulation and what-if scenarios through generative AI. You can actually access a raw-level data that is privacy secure, and at same time representative of your customer base and how the customer base will behave. Then, of course, upon that you can do aggregation, you can try machine learning models, you can do whatever you want with it. It would be granular level data that anyone in your organization will understand and will be able to use.

[00:44:15] Alexandra: This makes sense. Wonderful. Mario, thank you so much for being with us today. Maybe as a last question to you, are there any final remarks that you want to share with our listener? Any recommendations, how to proceed on their synthetic data journey, or what to test out now in the free version where we have all of these features that we talked about?

[00:44:36] Mario: Being a product guy the only recommendation I can do for any listener of the podcast is to try the latest version of MOSTLY AI in our free version, the features that we put in and give us feedback. Give us feedback about how you use synthetic data, how you actually would like to change a program, the synthetic data by augmenting or creating more diversity and so and so forth. Let us know. I'm here for that. I'm super curious about your use cases and to learn from any one of you.

[00:45:08] Alexandra: I was kind of expecting this answer. If you want to meet Mario, and of course, all of us at MOSTLY AI super happy, try it out and provide feedback. Thank you very much for being with us today, Mario.

[00:45:19] Mario: Thank you Alexandra for having me.

[00:45:28] Alexandra: Wow, there's really a lot ahead in the future of synthetic data. I hope you really enjoyed having this glimpse into what's next and what's already there. Where are we already tapping into when it comes to synthetic data beyond privacy? As always, we are super interested to hear your comments, questions, or remarks. You can either reach us on LinkedIn or writing us a short email via podcast@mostly.ai. Until then, I'm very much looking forward to hear you in a few weeks.

Synthetic data beyond privacy: data augmentation powered by AI

Transcript

Ready to start?