💡 Announcing the MOSTLY AI and Databricks Integration
Read all about it here
September 21, 2023
14m 9s

Conditional synthetic data generation tutorial

👋 Welcome to this comprehensive tutorial on Conditional Synthetic Data Generation!

In this tutorial, you'll learn how to exercise more control over the synthetic data you generate. By the end, you'll understand how to apply conditional generation in two different scenarios: equalizing gender income distribution in the UCI Adult Income dataset and maintaining the geographic locations of Airbnb listings in Manhattan.

📚 Chapters:

00:00 - Introduction
01:31 - What is Conditional Generation?
02:40 - First Use Case: UCI Adult Income Dataset
06:00 - How to Generate Synthetic Data using MOSTLY AI
08:41 - Analyzing the UCI Adult Income Dataset
11:40 - Second Use Case: Airbnb Manhattan Listings
15:37 - Generating Synthetic Data for Airbnb Listings
18:24 - Analyzing Airbnb Manhattan Listings

🔗 Resources: https://github.com/mostly-ai/mostly-tutorials/tree/dev/conditional-generation

MOSTLY AI 's free synthetic data generator: https://bit.ly/43IGYSv

❓ If you have any questions or comments, please feel free to leave them below.

👍 If you found this video helpful, please give it a thumbs up and consider subscribing for more tutorials like this.

Thank you and see you in the next video! 🙌

#ConditionalDataGeneration #DataSynthesis #MOSTLYAI #DataScience #UCIAdultIncomeDataset #AirbnbListings #SyntheticData #SyntheticDataTutorial


[00:00:00] Hi, welcome to this tutorial on conditional generation. [00:00:04] Conditional generation is an important concept and skill to master when you're interested in exercising more control over the synthetic data you generate.

[00:00:14] As the name of the tutorial suggests, conditional generation means that we will be generating synthetic data based on certain conditions that we specify in advance. This can be helpful across a range of different use cases.

[00:00:28] For example, data simulation, in which you want to take an original dataset that you have and you want to simulate what it would look like in the hypothetical situation that certain columns exhibit a specific kind of distribution that's different from the original dataset.

[00:00:43] In this tutorial, we'll be doing exactly that for the UCI Adult Income dataset, in which we'll be simulating what it would look like if there was no gender income gap.

[00:00:53] Conditional generation can also be useful in situations where it's important for you to retain exactly certain columns in the data set.

[00:01:01] We'll be seeing this in the second use case of this tutorial in which we will maintain the geographic locations of Airbnb listings exactly as they are and let all of the other columns of the data set be completely synthesized.

[00:01:13] It's important to note that the synthetic data that you generate using conditional generation is still statistically representative, but just within the context that you've created. The privacy of the data set is largely dependent on the privacy of the fixed attributes.

[00:01:30] How does this work? As this diagram shows, we'll start with our actual data set.

[00:01:36] In the first use case, that will be the UCI Adult Income dataset. In the second use case, we'll be working with the Airbnb listings of Manhattan.

[00:01:45] We'll pre-process this dataset by splitting it into a data frame containing our context,

[00:01:51] which is the fixed attributes, and our target, which is the attributes

[00:01:55] that we want to synthesize. These two tables will need to be linked to each other with a unique ID column.

[00:02:02] We'll then train a MOSTLY AI model on these two tables. Once we have that model ready, we'll use it to generate more data.

[00:02:06] This is where it gets interesting. Rather than just generating more of exactly the same kind of data,

[00:02:15] we'll provide a seed context for this generation step, in which we will say what we want the fixed attributes to look like. With this seed context, we'll then generate more synthetic target data.

[00:02:28] The last step is to merge these into a single data set that is partially synthetic with the seed context, which was predetermined by us, and then the rest of the columns, which are entirely synthetic. All right, let's jump in.

[00:02:42] We'll start with our first use case, in which we will be simulating what the UCI Adult Income data set would look like if there was no gender income gap. Just a quick refresh,

[00:02:54] we'll start with our dataset, we'll split it into two, we'll create a seed context, use that to generate more data, and then merge the two together.

[00:03:03] As always, first step is to access the data. If you're running this in Colab, you can just point your notebook to the repo and access the data that way.

[00:03:14] Once we have access to our data, let's read in the CSV with pandas, and we'll split this CSV into our context with the sex and income columns and the target, which is the rest of the columns.

[00:03:26] We then create a unique ID column to link the two tables together, and we'll save these as CSV. Here they are.

[00:03:36] The next block of code is just there to make it easy for you to download these CSVs from Colab to local disk. That will allow you to download the Census context file, save that to disk,

[00:03:48] and the same with the target. Once you have both of these downloaded, you can head to your MOSTLY AI account to start the synthetic data generation job.

[00:04:02] Create a new job and upload the context file as the main table. Proceed and then add another table which is the target file. Proceed. You see now, we see here this legend up here, that both of these are defined as subject tables.

[00:04:24] We need to change this so that this becomes a linked table and this remains a subject table with the ID column being the key that relates them together. We'll go to Data settings, head to Census target table, and click on the ID column.

[00:04:39] For Generation method, let's select Foreign key and select the Context key type with the Parent table being census context and the key being the ID column. If you want to learn more about the different kinds of generation methods, I would recommend checking out our documentation. Save this,

[00:04:59] and then we can confirm that this is now a linked table, and we can launch the job. Of course, selecting a destination, in this case, downloading a CSV or Parquet. Launch the job,

[00:05:12] and while this is running, we can go ahead and create our seed context, which we need in the next step. As we mentioned earlier, the goal of this use case is to simulate what the dataset would look like without the gender income gap.

[00:05:26] In this next step, we create a data frame, which will serve as the seed context, in which we create an equal distribution between male and female records. Remember, this is 1994, and we've only got two genders. We will replicate the income distribution. In this way, we'll be simulating the dataset without the gender income gap. We'll save that to CSV.

[00:05:50] and then download the data to disk.

[00:05:55] Then let's head back
to our MOSTLY AI account. We see that the job is still in progress,
so I'm going to wait for this to complete and then I'll be right back.

[00:06:04] We're back. The census context job has completed.

[00:06:05] We can now, from this step, use this model
to generate more data with our seed context.

[00:06:16] We'll select Generate More Data. You'll see that there are
two generation method options here, one by quantity,
which will just generate more data with the same distributions,
or with the seed, which is what we'll select.

[00:06:29] This is where we'll upload
our census seed file. We'll then launch this job.

[00:06:39] This will again probably take
a couple of minutes to run, so I'll be back when this finishes.

[00:06:45] This job is also done.
We can now download our synthetic data as CSV, save it to disk, unzip it,
and we'll see here two CSV files. One for the target,
which is the synthetic data that we're interested in,
and another for the context, which is an exact replica
of the seed context that we provided earlier.

[00:07:09] We'll take both of these CSV files
and merge them into one in our final step. The next step is to upload these back
into our Colab notebook. We'll upload the synthetic target here,
and we'll upload the synthetic seed here.

[00:07:31] Once both of those
are done, we can merge these into our full partially synthetic dataset.
We've now completed all of the steps and we have our full dataset here
with our seed context determined by us and the synthetic variables
generated by MOSTLY AI.

[00:07:50] Let's take a look at the data
that we've created. We can use this cell to randomly
sample 10 records out of the dataset and we should see about
an even split between male and female

[00:08:03] in every random sample,

[00:08:06] which seems to be the case. Let's now take a look at how some of the other attributes respond to this change in gender income gap. For example, let's look at the age distribution. Here we plot the original female age distribution in black, and the new synthetic one in green,

[00:08:23] and we see a significant shift to the right. The female records in the synthesized dataset are significantly older than those in the original dataset, in order to meet this criteria that we provided of removing the gender income gap.

[00:08:36] Of course, there's a lot more we could explore here, but I'll leave that up to you. Let's head on to the second use case. In the second use case, we'll be working with the 2019 Airbnb listings for Manhattan.

[00:08:47] This is an interesting dataset to work with for conditional generation because it contains geographical data. It's crucial in this case for the geographical data to stay intact. We do not want to end up with synthetic Airbnb listings located in places where they simply couldn't exist. For example, in the middle of the Hudson River or on top of a hill in the middle of Central Park.

[00:09:09] What we will do is that we will split our dataset again into just the locations and all of the other attributes. We will keep the locations exactly the same, so we will not mimic or simulate any other kind of distribution. We will use that as seed in order to generate partially synthetic data for all of the other columns.

[00:09:27] In many ways, this process will be the same as the first use case, so I'll work through this one a little bit quicker. In this cell, we access the data, and in the next, we concatenate the latitude and longitude into a single column in order to correspond with MOSTLY AI's geographical

[00:09:42,080] data processing method.

[00:09:44,369] We will then split our dataset into the context and target and save those as separate CSVs. We'll download both of these to disk again. Once we have saved the files to disk, we can then head back to our MOSTLY AI account and launch the jobs exactly the way we did before. One important difference is that we have to specify latitude and longitude as the encoding type for the LAT_LONG column. Another thing you can do to save time here is to set the training goal to speed for the first training job because we will be providing the exact locations in the next step. Once this job has been completed, we can generate more data and we'll upload the Airbnb locations file as the seed.

[00:10:28,948] Note that here, unlike in the previous use case, we don't change anything about our seed. We want the locations to remain exactly as they are. We'll launch the job, and we'll wait for this run and I'll be right back.

[00:10:44,327] All right. When the generate with seed job is done, we'll download the data as CSV again, and unzip them as we did in the step before. Once we've downloaded those, we can go back to our notebook and use this cell to upload the synthesized data into our Colab notebook.

[00:11:03,718] The next cell takes this synthetic data and joins it back to the original df context, which contains the real geographic locations. We then split the LAT_LONG column back into two separate columns, and we restore the column order. We can then take a look at our synthetic data. We have the first three columns, which are exactly the same as the original dataset. These are the original geographical locations and neighborhoods of the listings, and the other columns have been completely synthesized.

[00:11:33,088] Next, let's dig a little bit deeper to compare

[00:11:35,758] this partially synthetic dataset that we've created against the original dataset. Specifically, we'll be looking at the price distribution across Manhattan and plotting each Airbnb listing in both datasets and their price per night.

[00:11:50,488] What we hope to see is two things here. The first thing we hope to see is in both cases, both for the original and the partially synthetic data, a map roughly resembling that of Manhattan. Since these are the real Airbnb listings and Manhattan is pretty saturated with Airbnb listings.

[00:12:05,848] Secondly, we hope to see that the statistical distribution of the price per night is similar in both datasets. Remember that we've kept the locations the same across both datasets, but that the price column is one of the columns that has been synthesized.

[00:12:20,758] We shouldn't see the exact same distribution because that would mean we've just replicated the original data, but rather a statistically representative distribution. Areas that have lots of expensive apartments in the original dataset should have lots of expensive apartments in the partially synthetic dataset.

[00:12:37,877] Let's take a look. We'll plot both datasets with this cell, and we can see that in the original dataset, we have this gradient of cheaper listings up here, getting more expensive as we move south, with some very expensive clusters around Central Park, Financial District, and some other areas.

[00:12:56,848] If we look at the partially synthetic data, we see this distribution very neatly replicated. We first of all have exactly the same points because we're using the same listings. We don't have anything showing up in the middle of Central Park or in the Hudson River, which is what could have happened if we had synthesized the locations.

[00:13:13] We also see a similar distribution of the price per night.

[00:13:16] We see the same gradient of lower prices up here, as well as the expensive clusters in similar areas.

[00:13:23] This is great because it means we've successfully synthesized the price column within the context of the locations that we've given it.

[00:13:30] This is a successful example of conditional generation.

[00:13:34] Well done. With that, we wrap up this tutorial on conditional generation. There's of course, a lot more you could explore here. You could, for example, try synthesizing the entire Airbnb dataset, including the latitude and longitude columns, and seeing how that would differ here in this plot,

[00:13:49] or you could try to use a different set of fixed columns for the US Census dataset to perform a different data simulation exercise.

[00:13:56] If you do end up exploring this topic further or working through this notebook and running into some questions or things you'd like to share with us, please do reach out.

[00:14:03] We'd love to hear from you. Thank you so much for your time and see you in the next tutorial.

Ready to get started?

Get started for free or get in touch with our sales team for a demo.