💡 Download the complete guide to AI-generated synthetic data!
Go to the ebook

And is it really possible to securely anonymize the location data that is currently being shared to combat the spread of COVID-19?

To answer these and more questions, SOSA’s Global Cyber Center (GGC) invited our CEO Michael Platzer to join them on their Cyber Insights podcast for an interview. For those of you, who don’t know SOSA: it’s a leading global innovation platform that helps corporates and governments alike to build and scale their open innovation efforts. What follows is a transcript of the podcast episode.

William: Wonderful, now Micheal, when you think about the broad array of cybersecurity trends that are unfolding today – ranging from new threats to new regulations – what is really top of mind for you in 2020?‍
 

Michael: Thanks for having me! We are MOSTLY AI and we are a deep-tech startup founded here in Europe while preparing for GDPR. Very early on, we had this realization that synthetic data will offer a fundamentally new approach to data anonymization. The idea is quite simple. Rather than aggregating, masking or obfuscating existing data, you would allow the machine to generate new data or fake data. But we rather prefer to say “AI-generated synthetic data”. And the benefit is, that you can retain all the statistical information of the original data, but you break the 1:1 relationship to the original individuals. So you cannot re-identify anymore – and thus it’s not personal data anymore, it’s not subject to privacy regulations anymore. So you are really free to innovate and to collaborate on this data – but without putting your customers’ privacy at risk. It’s really a fundamental game-changer that requires quite a heavy lifting on the AI-engineering side. But we are proud to have an excellent team here and to really see that the need for our product is growing fast.

William: Very interesting! Now, we know that location data is among our most accessible PII – we kind of give it out all the time via our mobile device. In the wake of the coronavirus, we are seeing calls to use our location data to track the spread of this pandemic. Is it possible to really effectively anonymize and secure our location data? Or can this data just be reverse engineered? Could using synthetic data help?

Michael: Yes definitely, and we are also engaging with decision-makers at this moment in this crisis. Location data is incredibly difficult to anonymize. There have been enough studies that show how easy it is to re-identify location traces. So what organizations end up with is only sharing highly aggregated count statistics. For example, how many people are at which time at which location. But you lose the dimension at the individual level. And this is so important if you want to figure out what type of socio-demographic segments are adapting to these new social distancing measures, and for how long they do that. And is it 100% of the population that’s adapting, are social contacts reducing by 60% or is it maybe a tiny fragment of segments that is still spreading the virus? To get to this kind of level to intelligence you need to work at a granular level. So not on an aggregated level, but on a granular level. Synthetic data allows you to retain the information on a granular level but break the tie to us individually. We just, coincidentally, in February wrote a blogpost on synthetic location traces – so before the corona crisis started – because we were researching this for the last year. It’s on our company blog and I can only invite people to read it. Super exciting new opportunities now to anonymize location traces!

William: That is exciting – and it sounds as if it could be very helpful, especially given what we are all going through! Now, Micheal, there is an expanding list of techniques to protect data today; from encryption schemes, tokenization, anonymization, etc. Should CISOs look at the landscape as a “grocery shelf” with ingredients to be selected and combined or should they search for one technique to rule them all?

Michael: Well, I don’t believe that there is a one-size-fits-all solution out there. And those different solutions really serve different purposes. It’s important to understand that encryption allows you to safely share data with people that you trust – or you think that you trust. Whether that’s people or machines, at the end, there is someone sitting who is decrypting the data and then has access to the full data. And you hope that you can trust the person. Now, synthetic data allows you to share data with people where you don’t necessarily need to rely on trust, because you have controlled for the risk of a privacy leak. It’s still super valuable, highly relevant information. It contains your business secrets, it contains all the structure and correlations that are available to run your analytics, to train your machine learning algorithms. But you have zeroed out your privacy risk! In that sense, synthetic data and encryption serve two different purposes. So every CISO needs to see what their particular challenge and problem is that needs to be overcome.

William: Well Michael, we’re coming up on our time here. Are there any concluding remarks or anything you would like to add before we hang up?

Michael: Well, we just closed our financing round so we’re set for further growth both in Europe as well as the US. We’re excited about the growing demand for data anonymization solutions, also for our solution. Happy to collaborate with innovative companies, who take privacy seriously. And of course, I wish everyone best of health and that we get – also as a global community – just stronger out of the current crisis.

We proudly present the free version of MOSTLY AI's synthetic data platform and warmly welcome you to try it out.

What does MOSTLY AI's synthetic data platform do?

MOSTLY AI enables you to automatically transform your privacy-sensitive big data assets into highly realistic and accurate synthetic datasets. The benefit is, that synthetic data is fully anonymous and thus exempt from data protection regulations. This results in as-good-as-real data, that is free to use, share or monetize. So with MOSTLY AI's synthetic data platform, you are finally able to freely innovate with one of your most valuable resources – all while avoiding the financial, regulatory or reputational risks of a privacy violation!

How does our synthetic data platform work?

MOSTLY AI's synthetic data platform leverages state-of-the-art generative deep neural networks with in-built privacy mechanisms to automatically learn the patterns, structure and variation from an existing dataset. Once the training is completed, it allows you to simulate an unlimited number of highly realistic and representative – but completely anonymous – synthetic customers. Thereby, you are able to retain all the valuable information in your data assets, while at the same time rendering the re-identification of any individual impossible.

How does the free version compare?

The free version of MOSTLY AI's synthetic data platform comes with the same powerful core technology as the enterprise version: our fully automated Synthetic Data Engine. This will enable you to generate synthetic data based on a provided actual dataset and to experience the magic of generative AI in action. Try it out today and persuade yourself of its quality, its flexibility and its ease-of-use.

Automatically generated quality assurance reports allow you to easily compare
the quality of your synthetic data (blue) with the original (grey).

But keep in mind, that it is a free version and that some usage restrictions apply. While the enterprise version of MOSTLY AI's synthetic data platform supports multiple tables and extremely large datasets with millions of rows and hundreds of columns, we limit the input for the free version to a single table with 50.000 rows and 50 columns. Furthermore, you won’t have the same flexibility as with our enterprise version to configure the training or the generation process. And as we operate this demo on a low-cost cloud infrastructure, the compute time will be significantly longer than for production setups.

Lastly, since it is a demo version, you must not use it to upload any personal or sensitive data. But no worries – if you don’t have a suitable dataset at hand, just go to Kaggle and choose a publicly available dataset you like. Or, if you are looking for even more convenience, simply select one of the datasets already provided on the demo site to start your first synthesis run in a matter of seconds!

With that being said, we warmly welcome you to generate synthetic data and to start your own personal journey with this groundbreaking technology. We are very much looking forward to your valued feedback and are here to answer your questions, should any arise.

magnifiercross