🚀 Launching Synthetic Text to Unlock High-Value Proprietary Text Data
Read all about it here
October 16, 2023
13m 15s

Synthetic Text Generation Tutorial

Welcome to this comprehensive tutorial on synthetic text generation using MOSTLY AI's synthetic data generation platform. In this tutorial, we'll guide you through the process of creating high-quality synthetic text data and ensuring its statistical representativeness.

Here is what we cover:

1. Introduction to Synthetic Text Generation (0:00)
2. The Power of Synthetic Text Data (2:45)
3. Getting Started with MOSTLY AI's synthetic data platform (5:12)
4. Data Settings and Generation Methods (9:28)
5. Generating Synthetic Text Data (13:45)
6. Evaluating Statistical Representativeness (18:02)
7. Correlation and Data Quality (22:45)
8. Advanced Techniques and Tips (27:10)

Github repo with notebook and data ➡️ https://tinyurl.com/3hh473b5

Read more about synthetic text generation on our blog ➡️ https://tinyurl.com/24j8zau7

Register for free on MOSTLY AI's synthetic data platform ➡️ https://bit.ly/44jGBPr

Transcript

[00:00:00] Hi, and welcome to this tutorial on synthetic text generation. In this tutorial, you'll learn how to use MOSTLY AI to synthesize free text columns. In the tutorials that you may have done so far, we have looked mostly at structured tabular data but in this tutorial, we'll see that MOSTLY AI can also be perfectly used to synthesize free text data.

[00:00:24] You will learn how to use the MOSTLY AI platform to synthesize text columns and also how to assess the quality of the synthetic text that has been generated. Before you jump in, I would recommend checking out this blog post on how to scale up your text annotation with synthetic text, which looks at how synthetic text can be used in situations where you're working with free text but privacy is a big concern, for example, when you're working with voice assistant data. Once you're ready to go, let's jump in.

[00:00:56] In this tutorial, we'll be working with a dataset containing Airbnb data for the city of London. The dataset contains listings with descriptions, hostnames, the neighborhood, the price per night, and other data. You can access the data by clicking here and then using command or Control S depending on your operating system to save the file to disk.

[00:01:20] Next, we'll take this original dataset and synthesize it using MOSTLY AI. Go ahead and navigate to your MOSTLY AI account, and start a new synthetic data job and upload the Airbnb dataset here. Click Proceed. Before launching this job, it's important to go to data settings and set the right generation method for the text columns.

[00:01:42] The title column contains the listing descriptions. We want these to be processed as text and we'll do the same for the host name column. Once you've set both of those, you can go ahead and launch the job. Now generating synthetic text is quite a compute-heavy process, so this job will probably take around 30 to 45 minutes to complete.

[00:02:07] I'll let this run and I'll be back when it's completed.

[00:02:12] All right. We're back.

[00:02:13] Once this job is done, you can download the synthetic data here as CSV and save it to disc.

[00:02:22] Let's then return to the notebook.

[00:02:24] Since we're working in CoLab here, we'll need to upload the synthetic dataset so that we can access it from within the notebook.

[00:02:31] We'll do that in this cell. That's it.

[00:02:35] Really, that's all it takes to synthesize text data with MOSTLY AI, you can now access this dataset within your notebook and explore it.

[00:02:43] Of course, you shouldn't just take our word for the fact that this is high-quality synthetic text data. Let's be a little more critical and poke around the dataset to investigate the data quality here.

[00:02:53] Specifically, we're interested in how statistically representative the synthetic text data is compared to the original text data.

[00:03:01] In order to evaluate this, we'll look at a few different variables. We'll start by looking at the set of characters that are used in both datasets.

[00:03:10] The cell prints a set of characters that appear in the original and in the synthetic datasets. We see a nice overlap for the most common characters.

[00:03:18] For less common characters, we see that actually, we scroll this way, the set of characters in the original dataset is much larger than in the synthetic dataset.

[00:03:29] This is actually a privacy mechanism within MOSTLY AI, which makes sure that very rare tokens are removed from the dataset so that they do not reveal the presence of these individual records in the original dataset.

[00:03:40] Let's dig a little deeper by looking at character frequency.

[00:03:44] This cell gives us a table with the most common characters and how often they appear in both datasets.

[00:03:53] We see that the white space character appears 13.4% in the target dataset and also in the synthetic dataset.

[00:04:02] The character O appears 7.6% in the target dataset and 7.6 in the synthetic dataset.

[00:04:09] This continues on. What is really good to see here is that we have this retention of the statistical distribution of the characters from the target dataset in the synthetic dataset.

[00:04:21] While the order might be switching around or we might be making new words, we have the same amount of Os in both datasets and As and Rs and so on.

[00:04:30] Let's then plot the distribution of all the character frequencies. We see this perfect match between the target and the synthetic dataset.

[00:04:37] This is a very good first indication that the synthetic text that we've generated is statistically representative, at least in terms of character frequencies.

[00:04:46] Let's now look at the term frequencies, so not the individual characters but the terms, the words, if you will.

[00:04:52] We'll create a similar kind of table. In this case, looking at the 10 most common and the 10 least common terms.

[00:05:00] Again, we see this nice match between the percentage that terms occur in the target dataset as well as in the synthetic dataset. We see this both for common terms and for the uncommon terms such as twin or leafy, which appears, let me see if I can get this right, 0.067% of the times in both datasets.

[00:05:24] We can then also plot the distribution of term frequencies. Again, we see a very nice match between the synthetic and the target dataset, again, confirming the statistical quality of the synthetic free text that we've created.

[00:05:40] Let's go even further and look at term co-occurrence. Here we're looking at how often two words occur together in the original listings and compare that to the percentage of times they occurred together in the synthetic listings.

[00:05:53] We'll look at the pairs bed double,

[00:05:56] bed king and heart London and London heart.

[00:06:02] We see that 14% of actual listings that contain the word bed also contain the word double,

[00:06:08] and this is 13% in the synthetic listings, which is a nice match. 7% of actual listings that contain bed also contain king. This is 6% for the synthetic listings. Again, here we see a 28, 27, and a 4, 4, which is yet another indication

[00:06:26] that we're preserving the statistical distributions, not just of characters, not just of terms, but also of the terms that appear together. Now, you might be asking yourself, if all of these characteristics are so perfectly maintained, then what are the chances

[00:06:39] that we won't just end up with an exact match, a synthetic record which has the same description value as a record in the original dataset, or a synthetic record with the exact same values for all of the features?

[00:06:51] Well, let's start by trying to find an exact match for one specific synthetic title value. We will scroll up here and just get a description from the synthetic dataset,

[00:07:07] and copy that down here, and search for an exact match in the original dataset. We see that we find, in this case actually, a partial match. There's a record in the original dataset that has the description, a light and airy large double room, which matches partially with the one we've copied from our synthetic record, which was airy large double room.

[00:07:32] Now, of course, this validation process doesn't scale very well. We can't just look at all 71,000 records like this. Let's perform a more rigorous check for privacy by looking

[00:07:44] for exact matches between the synthetic and the original datasets. Before we do that, we are going to split the original datasets into two equal subsets and measure the number of matches between those subsets because what's actually important here

[00:07:58] is not so much whether we have exact matches between the synthetic and the original dataset, but whether the number of those matches is higher or lower than the number of matches within the original dataset.

[00:08:13] We see here that there are 323 instances of duplicate descriptions. There's multiple listings with the title cozy double room, and so on. Now we're going to do this with the synthetic dataset and see how many number of matches there are between the synthetic dataset and the original data. There's only 215, so there's less exact matches for the title field between the synthetic and the original dataset than there are within the original dataset itself.

[00:08:47] While we'll see that exact matches between the synthetic data and the original data can occur, we see that they occur only for the most commonly used descriptions, and that the original dataset actually has exact matches already. What's most important here is that the exact matches in the synthetic dataset don't occur more often than they do in the original dataset. It's important to note, therefore, that the mere presence of an exact match between the synthetic and the original dataset is not in itself a sign of privacy leak.

[00:09:17] There will be reason to worry if the number of exact matches in the synthetic dataset exceeds that within the original dataset itself. If you're interested to dive deeper into this, I would recommend checking out our peer-reviewed research article and/or blog post both linked here in the notebook for more information.

[00:09:36] All right, so now we've done some rigorous looking into the statistical representativeness of the text column within itself. We've looked at the character frequency, the term frequency, the term co-occurrence, and also the number of exact matches. What's also important

[00:09:51] is how this synthetic text column relates to the other features in the dataset. In the original dataset, we would expect some correlation between descriptions that, for example, contain the word 'luxury' and the value of the price column, we'd expect that price to be higher.

[00:10:08] It's important that these correlations are maintained in the synthetic dataset. We're going to do exactly that. We're going to look at the correlation between price and text in the original and the synthetic dataset.

[00:10:22] Specifically, we'll be looking at the medium price of actual listings that contain certain terms like luxury, stylish, cozy, and small, and compare that to the medium price of synthetic listings that contain the same words. We hope to find a close match.

[00:10:39] We see here that the medium price of actual listings that contain the word luxury is $180. For the synthetic listings that contain the same word, luxury, it's $174, which is close enough. For stylish, it's $134 and $133, cozy 70, 75, and small, 55, 55, which is an exact match.

[00:11:01] We see here that not only is MOSTLY AI's framework generating synthetic text that is statistically representative to itself, right, within the text column itself, but also that it retains the statistical properties between different columns in the dataset, which is crucial for generating high-quality synthetic data.

[00:11:19] With that, we come to the end of this tutorial. In this tutorial, you've learned how to generate synthetic text with MOSTLY AI. We generated the synthetic data and then looked closely at the statistical quality of this synthetic text.

[00:11:33] We did so by looking at character and term frequencies. We looked at term co-occurrence and also at the number of exact matches, all to determine the statistical representativeness of the text column itself.

[00:11:44] At the end, we also looked at how the synthetic text column correlates to other features in the dataset to make sure that the statistical properties between columns are also preserved.

[00:11:55] Of course, there's a lot more you could do here. You could, for example, look at correlations between other features such as hostname or the neighborhood.

[00:12:02] Another interesting thing to try is to set a different generation mood when you launch your job in MOSTLY AI. When you set the encoding type for your text columns, you also have an option here to set the generation mood.

[00:12:17] You can read more about that here in the interface or in our documentation, but there's a whole range of very boring conservative to representative, to all the way to creative and extremely crazy that you can try out with to see how this affects the type of text data that's generated.

[00:12:35] Finally, you could, of course, try working with a different dataset to see how that affects the synthetic text that's being created. If you do work through this notebook and run into any questions or things that you're very excited to share, please do not hesitate to reach out to us. We'd love to hear from you, and thank you so much for watching. See you in the next one.

Ready to start?

Sign up for free or contact our sales team to schedule a demo.
magnifiercross