💡 Introducing the MOSTLY AI Assistant
Read all about it here
July 27, 2023
4m 29s

Tips and Tricks for Synthetic Data Generation - Replacing Real Data with Synthetic in a Data Catalog

In this video, we'll be discussing tips and tricks for synthetic data generation, namely replacing real data with synthetic in a data catalog. Using synthetic data in a data catalog is an elegant way of showing privacy safe sample data in a data catalog, like Alation.
If you're looking to anonymize your data catalog, then this video is for you! By learning how to add synthetic data to your data catalog, you'll be able to display perfectly private, yet realistic-looking sample data.

Transcript

[00:00:00] Hello. In today's tips and tricks video, we're going to look at how to replace real data in a data catalog with synthetic data for the sampling.

[00:00:11] As an example, I'm showing you the Alation data catalog here, and we're looking at the table page for an airports table with a description that comes from the original data set, OurAirports. (https://ourairports.com/data/)

[00:00:25] Then you have the possibility in Alation to show sample data but the point is that now we want to show synthetic data. How could we go about that?

[00:00:39] Well, we can connect the Alation data catalog to the original database and the schema airport, the table airport, we can do metadata extraction there which would extract the structure of the table and put it in the data catalog.

[00:00:56] Then we can create a clone of the database which I have called aviation_syn here- syn for synthetic. -with the same schema name and the same table name. We can put our synthetic data there.

[00:01:11] I've used MOSTLY AI to read from the original database and then synthesize the data and put it into the cloned database. Now, of course, this is just an example. Airports don't really - They are obviously public, in the public space.

[00:01:30] There's no reason to protect the privacy of airports, but I've just used this as an example pretending that all the related attributes such as the name and the ID and the IKO code and all these things are effectively things that need to be protected from a privacy point of view.

[00:01:50] I still want to generate some rich data for users to take a look in the data catalog to get an idea of whether the data is going to be useful to them or not.

[00:02:02] If you know Alation, then you would know that the only way to do that in Alation proper is to mask the column and put the word 'sensitive', but then the user doesn't see what the data is that is hiding behind that sensitive flag,

[00:02:17] and so it's much more elegant to have synthetic data instead in the data catalog.

[00:02:24] I'm not going to spend too much time talking about how I generated this because there's lots of other videos around. Just to, perhaps, point out that I used character sequence as the encoding type for most of these, and I used text for the name. Okay?

[00:02:44] Then I generated the synthetic data and put it into this fake database so to speak. Then I did the following trick in Alation. This is something that, obviously, you will appreciate mostly if you work with Alation.

[00:02:58] In the settings for the database, especially in the case of postgres, the database is hidden in here, in the connection string to the database. If I click on the little pencil symbol here, you will see that right now it's set to Postgres, but then a few minutes ago when I ran the sample, I actually set it to aviation_syn. That's the trick.

[00:03:26] I basically tricked Alation into grabbing the data for the table from a separate place by changing the database name in the connection string. That's the trick here.

[00:03:40] By doing that, I have added the synthetic data as the sample data for a database that otherwise is, let's say, original in every aspect. If a user had privileges to then connect to the database using Alation's built-in compose tool then, of course, they would have access to the original data rather than the synthetic data.

[00:04:04] If you want them to access the synthetic data, you can play the same trick. In the compose connection string, you can put the database of the synthetic data. Then, of course, also, another layer of security prevent users from - by using database credentials and authorizations, et cetera, to not go and see the original data. That's it for today.

Ready to start?

Sign up for free or contact our sales team to schedule a demo.
magnifiercross