💡 Introducing the MOSTLY AI Assistant
Read all about it here
July 24, 2023
7m 1s

Tips and Tricks for Synthetic Data Generation - Names

MOSTLY AI's synthetic data generator can synthesize names in a number of ways. Depending on the use case, you can choose from different encoding types, such as Categorical, Mock data or Text. Each encoding type has its advantages and disadvantages. Check out the video to learn which encoding type you should choose for generating names for your synthetic dataset. Bonus tips included or encoding date-time formats! You can use MOSTLY AI's synthetic data generator to create your own synthetic data free of charge at mostly.ai


[00:00:00] Hello. In today's video, I want to talk about the different options you have for generating names and why it is quite important to think carefully about what you want to do when you synthesize data.

[00:00:13] Let's start by taking a quick look here at the dataset that we have. We have a dataset with 19,000 baseball players. These baseball players have names, first names, and last names, and they have a country that they come from. They throw with a certain hand, they have a certain height and weight, and they were born at a certain date.

[00:00:38] Now, when it comes to the names, we have three choices. I'm going to just start a new job here to show you how that might work. We go to jobs, we say create synthetic data from a catalog, because I've already created a catalog for this catalog job before for players.

[00:01:02] Then we can take a look at the settings here specifically for the name. You have AI generation as categorical. That would make sense, and I will show you what that looks like. AI generation as text, or you could also, as a third choice use mock data, and then say this is a person and you want a first name or a last name or a full name.

[00:01:28] Let's take a look. I've ran all these jobs already. Let's take a look and look at the output here. The last one I ran was this one, and in the synthetic data, we see that this one was generated as text.

[00:01:48] Then in this one here, it was generated as category. If we look in the synthetic data, we see as names, some that appear in the original dataset relatively frequently such as Jeff, Rich, and Jonathan.

[00:02:12] Then we also see this word _RARE_, which means that rare category protection has activated and prevented us mostly from revealing a particularly rare name, and therefore, infringing on the privacy of a person with that name.

[00:02:32] If hypothetically,

[00:02:34] you have a person with a very rare name, they see the synthetic dataset, they might argue, "Hey I know that I was in the dataset that you use for training because I have a very rare name."

[00:02:46] Roughly speaking, that's how the argument goes, and therefore, we protect rare categories by default to ensure the privacy of the synthetic subjects. We don't use those rare categories.

[00:03:01] Specifically, for names that could be-- There are many, many different names in the world. If you try to use categories for names, then you will encounter that problem very often.

[00:03:13] There are some names that are quite popular, and therefore, in the QA report here, if you look at the distribution of the first names, you will find that it will show some of the most popular names in this dataset. Bill, Bob, Ad, Frank, and so on.

[00:03:27] Even for the last names, that is true. You will see that Anderson, Brown, Davis, Johnson, Jones are very popular names that appear relatively frequently in the dataset, and therefore, can be used in the synthetic dataset to generate names.

[00:03:45] Now, does it really make sense? I don't think so. I think if you have names, this way of encoding doesn't really make sense, because what are you going to do with that name especially if many of them are tainted with that rare token?

[00:04:05] Could you use text generation? Well, so if you're really keen on every subject having a name and having a unique name or a statistically representatively unique name, then you could use this and the text generation. That would use the frequency of characters and the frequency of words to come up with something that is plausible for the synthetic data.

[00:04:33] Then you get names like this, Lefty Greteck, Joe Brierop, Luis Perez.

[00:04:38] Luis Perez looks very, very plausible, Jose Bartner, maybe as well. Some of them look a little bit less perhaps plausible, but, hey, at least you get a name for every person.

[00:04:54] Keep in mind that text generation is relatively expensive, so it's going to take a lot more processing power, and therefore, time to generate these if you really want them. The third option is the mock data.

[00:05:10] Here is one that was generated with mock data. Here, the names don't pretend to follow any statistics whatsoever, but again, you get a name for each and every role. In this case, I use the mock name only for the first name.

[00:05:32] What you get is things like Jeremy, Carla, Lisa, Crystal, et cetera, which for baseball players looks a little bit odd, but hey, you get a name. That's certainly the cheapest and the fastest option if you, for some reason, need to have a name in that field.

[00:05:50] As a bonus, I can also talk for a second about the date. Again, it's important that if you have a date, that you configure it as date-time rather than categorical because especially birth dates are quite unique. If you use category for such a date, then many of the birth dates will be flagged as _RARE_ and you will not benefit from interesting distributions.

[00:06:20] Whereas here, you can see it was categorized as a date-time value. Then if you look at the unary distribution for the birth date, you actually see a very nice distribution. You see that the bins have been filled and the bins are quite reasonable with about 19 years or so in between the different bin sizes.

[00:06:52] That was it. Important to think about what you're going to do with the data, and therefore, which encoding you choose.

Ready to start?

Sign up for free or contact our sales team to schedule a demo.