[00:00:00] Hi, and welcome to this tutorial on performing multi-table data synthesization. In this tutorial, you will learn how to create synthetic data from a relational data structure. In many of the tutorials we've done so far, we've looked at synthesizing a single data table, but in many use cases, you will have data spread across many different tables. For example, in a relational database.
[00:00:20] MOSTLY AI makes it easy to take this relational data and synthesize it all at once, maintaining all the relationships between the tables. To see this in action, we'll be working with a trimmed-down version of the Berka dataset containing financial transactions from Czech banks.
[00:00:39] This dataset contains eight tables, one of which is a reference table, which we won't synthesize since it doesn't contain any sensitive information. The other seven tables all contain privacy-sensitive information. We'll synthesize all of those, including the relationships between them.
[00:00:55] Now, MOSTLY AI offers you two different ways to perform multi-table data synthesis, and you will explore both of those in this tutorial. The first approach is through an ad-hoc job by manually uploading the data tables, usually a CSV or Parquet files, and then manually defining the relationships between them.
[00:01:13] The second option is to connect MOSTLY AI to your relational database and import the relational data directly from the database. This will also import the relationships and save you some time on defining all those relationships manually. The nice thing about the second approach is also that MOSTLY AI can deliver your synthesized data directly back into your relational database.
[00:01:34] While we'll walk through both approaches in this tutorial, the main focus will be on the second approach in which we will connect directly to a relational database.
[00:01:43] as this is the most common approach for working with multi-table data that we see in industry.
[00:01:47] We'll start by first helping you set up the database infrastructure and connecting MOSTLY AI to that database. We'll then synthesize the data, and once we have all the relational data synthesized, we will poke around the dataset a little bit to inspect its referential integrity and to check the statistical properties of the synthesized data.
[00:02:08] All right, let's jump in. All right, so let's start by getting you set up with the database. I'll be working in Google Cloud provider for this tutorial, but you can of course use a different cloud provider if you don't have a local database set up.
[00:02:23] We'll create a Postgres SQL instance, we'll give it a name, let's say Berka-multi, and the password. I'll make some choices here to keep this cost-effective. The most important thing to remember here is to create a network in order to allow MOSTLY AI and also the CoLab notebook if you're working in Colab to access this Postgres SQL instance.
[00:02:53] Once that's set, we can create the instance. This will take a few minutes. I'll be right back.
[00:02:59] All right, the instance is now up and running, so we can access it here. Let's copy the public IP address and go back to our notebook. You can copy the IP address here into DB host, set DB user to Postgres. Then here I'm using the dotenv library to pass the password security as an environment variable.
[00:03:20] We'll run that cell and the next step is to create two database instances, one to hold the original data and a second instance that will serve as the destination where MOSTLY AI will drop the synthesized data when it's done. The next cell creates two helper functions. One to create the database and another to connect to it. We'll prepare that
[00:03:41] and then we can go ahead and create our source and destination databases.
[00:03:46] We'll call the source database berka_original and the destination database berka_synthetic.
[00:03:52] All right, now we can load the dataset from the repo into the source database. Let's first check whether we're running in Google CoLab or locally.
[00:04:00] I'm running this locally, and if you are too, then you have to make sure that your repo is set to the correct path. In the next cell, we will take our CSV files that are in the repo and use the pandas to SQL function to turn these into SQL tables.
[00:04:14] Then we'll add those to the source database. We can see that the tables are being created. Now, we're all done with no issues encountered.
[00:04:25] Now, the next cell is very important because in this cell we are defining the relationships between the tables.
[00:04:31] The first block here defines all the primary keys, so the unique ID column that is within each table. Then the second block defines the relationships between the columns, so we can run this cell.
[00:04:42] Now our relational database contains the information it needs about the relationship between the tables. Amazing.
[00:04:50] You now have a relational database up and running with two instances, one for the source and one for the destination data. You have your source data in the database with all the relationships defined.
[00:05:02] The next step is to take this original relational dataset and to synthesize it using MOSTLY AI. Let's go ahead and do that.
[00:05:11] Go ahead and navigate to your MOSTLY AI account. We'll need to do a little bit of setup here in order to prepare MOSTLY AI to retrieve the data from the relational database and to deliver it to the second database instance correctly. In order to do that,
[00:05:28] we'll need to define some connectors and the data catalog.
[00:05:31] We'll start with defining the connectors. You can go to the connectors tab and create a new connector. You'll click on Postgres SQL in this case and give the connector a name, for example, berka_original. Select whether it's a source or destination type.
[00:05:50] Then provide all the information that MOSTLY AI will need in order to connect to this instance. We've already done so here, but I can show you what that would look like. We've got the connector name, the hostname, the port, the username, the password, and the database and schema.
[00:06:10] You should then also do the same for the synthetic database instance. Again, providing the name and all the other information.
[00:06:18] Once that's done, navigate to the catalogs tab and create a new catalog. You have to select the starting original database here. We'll start with berka_original and click on next.
[00:06:31] Now we see that we have no data here, so let's add some tables. We'll start with the account table and include all the child tables. These are all the linked tables.
[00:06:43] If you go back to our notebook at the top, we'll see that account has actually quite a few tables linked to it. Transaction, loan, and order and also disposition, and disposition has card linked to it. All of these five tables should be pulled in.
[00:07:01] All right, so we now have the account table and the five tables that are linked to account. There's a cool little feature here. The relationship diagram shows you how these tables that you have imported so far are linked together.
[00:07:14] We see that the relationship information has been correctly imported from the database. We're just missing the client table. Let's add that and make sure to also include the child tables for this one.
[00:07:27] so that the relationships are important.
[00:07:30] All right, the client table has now also been imported.
[00:07:34] If we look at the relationship diagram, we see that the account and the client tables are two subject tables.
[00:07:41] We have a bunch of different linked tables.
[00:07:44] We also notice two different types of relationships between tables.
[00:07:48] We have a context relationship,
[00:07:50] which preserves a one-to-one mapping between primary and foreign keys.
[00:07:55] We also have something called a Smart Select relationship,
[00:07:58] which is generally used when a table has two different subject tables linked to it.
[00:08:04] In this case, we have the disposition table,
[00:08:07] which is linked both to account and to client, but we cannot use both of those as the context for the generation.
[00:08:16] We have defined the relationship between disposition and account as context.
[00:08:21] There will be a one-to-one mapping preserved between the account ID and disposition and the account ID in account.
[00:08:29] But the mapping between client ID and disposition and client ID and client will be done using the Smart Select relationship,
[00:08:38] which is a mechanism that preserves the mapping on the best-effort basis. If you're interested to learn more about this,
[00:08:44] I would recommend checking out our Documentation here on the Smart Select relationship in order to better understand how this works.
[00:08:51] One way that we can improve the Smart Select feature is to go to Data settings and go to disposition and go to the client ID column,
[00:08:59] which is now set to the Smart Select generation method. We will add the rows here that we'll use for the Smart Select configuration.
[00:09:10] This will improve the best effort and mapping between client ID and the disposition table.
[00:09:18] Once we have that set up, we can click next up here and our job is ready to launch.
[00:09:24] Well, almost, we still have to select our data destination.
[00:09:25] In many of the previous tutorials, we have usually used download as CSV or Parquet as the destination. In this case, we'll deliver the synthesized data right into the database instance that we created for this purpose. You can select this option here. If you're running this job a few times in a row and you want to remove the tables that are already in the destination folder. You can now launch the job.
[00:09:51] We see here now that we have our catalog job running, we see that it is fetching data from all the tables. It will continue to analyze, encode the data, to train the model, to generate the data, to analyze the data, et cetera, and so forth. This will take a little while to run, so I'll be back in a bit. We're back.
[00:10:13] Once this job is done, normally you would go here to download the synthetic data. Remember that we set up the database connector and selected that as the output destination for the synthetic data. If we go to our notebook, we should be able to access the database instance and pull the data from there.
[00:10:30] All right, so we're back in our notebook and we can use this cell to fetch the synthetic data from the destination database. That looks like it's working, we're extracting the different tables. Now let's show three sample records for each table. We get some for account, for card, for client, and all the other tables. That looks pretty good.
[00:10:55] Now let's do some basic quality checks. Let's see how statistically representative the synthetic data is of the original. Of course, you can also always go to the QA report and inspect the accuracy and other measures here.
[00:11:09] Just for good measure, let's do it here in the notebook as well.
[00:11:12] We see here that the 10, 50, and 90% quantiles of the transaction amounts match up nicely between the synthetic and the original dataset.
[00:11:20] Let's see how that's looking for the dates.
[00:11:23] We see that we get pretty close. It's a good sign.
[00:11:26] Next, let's check the referential integrity.
[00:11:29] Here we're checking whether all of the primary keys correspond correctly to all of the foreign keys in the related tables.
[00:11:36] We're checking here if all of the account IDs in transaction match those in account for the synthetic data set. Then we do the same for card and disposition.
[00:11:47] We get no error message, so this assertion passes correctly and we're all good.
[00:11:53] Next, let's take a look at the cardinality of the context foreign key relations.
[00:11:58] Remember, we defined a bunch of different relationships between the tables, and most of these we defined as context relationships, which should keep a one-to-one mapping of the foreign and primary keys.
[00:12:10] We see here that the orders per account in the synthetic and the original match up nicely.
[00:12:16] Let's take a look at the cards for disposition, and again, we see that they match well.
[00:12:21] Now let's take a look at the cardinality of the Smart Select foreign key relationships.
[00:12:26] We expect these not to be maintained because the Smart Select is a lot more flexible and does not place any guarantees on the retention.
[00:12:34] We see indeed a mismatch between the dispositions per client and the synthetic and the original dataset.
[00:12:40] Now if you remember, we also helped the Smart Select configuration a little bit by adding specific rows that would help in the configuration.
[00:12:48] If we run this cell right now, we should see that the share of accounts and clients with identical district IDs
[00:12:54] should come close to each other. Indeed they do. This is because we added those columns in that Smart Select configuration option.
[00:13:03] With that, we come to the end of this tutorial. In this tutorial, you have learned how to synthesize a multi-table relational database.
[00:13:11] We saw how to connect to the database, how to fetch the data from that database and synthesize it while maintaining all of the relationships between the tables. We then inspected this data and saw that both the statistical properties and the referential integrity are perfectly retained.
[00:13:26] You also learned about the difference between context and Smart Select relationships and saw the limitations of the Smart Select relationship.
[00:13:33] We saw that whether your files are stored in a relational database or on disk, you can use MOSTLY AI to easily import the data, maintain the relationships, and synthesize all the multi-table data at once.
[00:13:46] If you do work through this code and run into any questions or things that you would like to share, please don't hesitate to reach out. We'd love to hear from our users. Thank you so much for watching and see you in the next one.