💡 Download the complete guide to AI-generated synthetic data!
Go to the ebook
July 25, 2023
6m 54s

Tips and Tricks for Synthetic Data Generation - How to Empty the Target Schema

Transcript

In this video, we'll show you how to make sure that your target schema is empty before you start generating synthetic data with schemas. #syntheticdata, #syntheticdatageneration, #tipsandtricksforsyntheticdata, #emptyschema #psql

Transcript

[00:00:00] In today's tips and tricks video, I want to talk about how to make sure that a target schema that you want to use for creating synthetic data is actually empty, and if not, how to make it empty.

[00:00:13] Let's quickly review why we need two schemas. We need one schema to read from the database, and then we need another one to write to the database. I've already set up both connectors here, and let's take a quick look.

[00:00:28] One is called players. That's where the original data is from our baseball players. We're using the usual connection parameters here. Then we have a schema called baseball.

[00:00:41] This one uses a capital letter, which actually is not my favorite because it will require me using quotation marks later on, but for now, I will live with that. Then there is the output schema here, which is called baseball_syn. I always use syn to indicate that something is synthetic, and that schema is called baseball-underscore-syn, all lowercase.

[00:01:07] Everything else is the same. We have one schema to read from, and then we have another schema to write to. Remember that the destination schema needs to be empty. This is where we are now going to check what's going on.

[00:01:25] I'm using a PSQL client, which is the command line client. It's the most crude or raw technique perhaps of doing this. It's not got a fancy UI, but if you have a fancy UI, it should be very easy to follow these same steps because it will probably be even easier than this.

[00:01:47] The PSQL client here uses the same credentials and configuration parameters that you just saw in the UI there. It just allows me to then use either these backslash commands, which one just has to remember, or you can also use SQL, so I'll show you both.

[00:02:10] dn will show you all the schemas. That's the baseball schema that we were talking about, and that's the baseball synthetic schema that we were talking about.

[00:02:20] Then you can use the spec d, which means display Baseball-underscore-syn-dot-star to see if there's any tables in that schema. It turns out that yes, there are two tables in here because someone has run a synthetic data job before. There are two tables there that need to be gotten rid of.

[00:02:43] The other way to do the same thing with SQL is to say, select table name from information schema.tables where table schema equals baseball. Well, in this case, I'm looking at the origin. Obviously, that's something that we want, but then we can say players-underscore-syn, and we should get the same result that we got up there. Let's see. I'm not, oh yes, baseball_syn, of course.

[00:03:27] These two tables are already there, and I need to get rid of them, so there's two ways to do that. The obvious way perhaps might be to drop them one by one, but I'm somewhat lazy, so I'm going to do it in a more radical way, which also ensures that the schema is really, really empty. I can do that by saying drop schema, baseball_syn, and then cascade, which means that it will delete everything else.

[00:04:04] All children objects are specifically tables, and if I do that, then not only are the tables gone, but also the schema is gone. I need to recreate it. Now, it's recreated, now if I do the same query that I did before, it should come up empty. Or if I do the same d query, it should also come up empty. That's how I can make sure that the schema is empty, and then I can go and start my job.

[00:04:45] Let's see how I can do that. I go into catalogs, say create catalog from the players input schema. Next, select all the tables that I'm interested in, and they are these two. Mark subject tables. Only players is a subject table. The other one is a linked table, order them, nothing much to order and then confirm my choices.

[00:05:12] Perfect. Now, if you've seen similar demos, you know that they need to be linked with a primary key - foreign key relationship. In this case, that has already happened.

[00:05:25] Because it's a SQL database that relationship has already been established. You can see that the player's ID is indeed referencing the player's table. Nothing for me to do.

[00:05:40] Then, under here, I can double-check that the dates are handled in the correct way, and actually turns out that the database table for the column type for birth date and death date is string in the database. Perhaps that's something that I should fix in another tips and tricks video.

[00:06:04] For now, I'm just manually overriding that and say, "Hey, I really want you to treat the birth date and the death date as date time." For the rest, I perhaps don't need the name at all because in another video I explained how the name doesn't really make sense to use in the synthetic data set.

[00:06:25] There we go so we can start the job and it will complain now about not having set a destination. I can fix that and say, here I want to put into players_syn into that schema. That's where I want to write my synthetic data and launch the job.

[00:06:47] Then we can come back and take a look at the wonderful synthetic data later but for now, that's it.

Ready to try synthetic data generation?

The best way to learn about synthetic data is to experiment with synthetic data generation. Try it for free or get in touch with our sales team for a demo.
magnifiercross