💡 Introducing the MOSTLY AI Assistant
Read all about it here
August 8, 2023
5m 2s

Augmenting synthetic banking data with the help of ChatGPT - Part 2

In this video, we'll be continuing our discussion of how to generate and use synthetic data. We will supercharge our synthetic banking data by augmenting it with the help of ChatGPT. In the first video, we introduced you to synthetic data generation and showed you how to explore your resulting synthetic data in Tableau.

Tune in to learn more about this powerful tool and how to use it to your advantage!

Explore your own datasets in privacy-safe ways - simply go to mostly.ai and register your free forever account!

Transcript

[00:00:00] Hi, folks. In the previous video, we generated some synthetic banking data and began to explore the data both on platform and off platform to examine how representative it was to the original production data.

[00:00:13] The conclusion we drew was that you could tell the same story with the privacy safe synthetic data as you could tell with the original data.

[00:00:19] In this video, we're going to take a look at a second use case, and we're going to take that same synthetic banking data and upload it to ChatGPT code interpreter to do some exploration, some analysis, and then ultimately use it to help inform our next synthetization on platform.

[00:00:36] Let's get started.

[00:00:38] As you might be aware, in ChatGPT Plus, there is a beta version of code interpreter, which allows you to upload files. As you can see here, I have my bank marketing synthetic data saved.

[00:00:51] I'm going to give it a few prompts. The first one is a little bit of information about the dataset itself from the UCI website. Then, I'm going to ask it to tell me about the data, create some basic graphs and visualizations, and then tell me the three key variables impacting variable y, which, again, is whether the campaign was a success or not.

[00:01:13] Let's kick that off and take a couple of seconds. Let me pause here. ChatGPT has just finished doing its magic here.

[00:01:22] As you can see on the screen, it has given me a profile of the dataset, all the columns, the variables. It's created some basic graphs and visualizations, and it's also extracted the three key variables impacting my classification goal.

[00:01:38] I think this is just beautiful, because when we talk about data democratization, it's one thing getting high-quality data in the hands of all data consumers, but once you get that data in their hands, you also have to make sure that they can understand the data,

[00:01:51] that they can analyze the data, and ultimately use the data.

[00:01:55] That interplay between functionality like Code Interpreter and synthetic data, which obviously enables you to upload the privacy safe synthetic data in a way that you wouldn't do with your privacy sensitive production data, is just beautiful for data democratization.

[00:02:11] Now that we have that basic analysis done, let me show you how I'm going to use Code Interpreter to validate my next synthetization on platform.

[00:02:23] I want to upsample the duration of calls in my dataset because it's told me it's one of the key variables. To do so, I want to understand what the average length of those successful calls is.

[00:02:36] As you can see on the screen, it's 540 seconds. Let me switch back on platform and upload this dataset here, which is bank marketing-modified.

[00:02:46] Based on the information told to me by ChatGPT, what I did was create another categorical column called Duration High. What I'm going to do using the MOSTLY AI platform now is to upsample the number of instances where call durations went to 540 seconds or above, so that we can examine how that impacts our classification goal y.

[00:03:10] What you can do here on the MOSTLY AI platform is pretty cool. You can rebalance a column, and I've chosen, obviously, Duration High. In the dataset that is indicated by True IEFE call is 540 seconds or longer. It is marked as true. I'm going to upsample that to 50% and then I'm going to choose my destination, and I'm going to launch the job.

[00:03:38] Just like the previous video, let me go into a previously run bank market dataset and I'll go through to the QA report. Important distinction here, we have our model QA report, which shows how closely the model has learned the original data, and then the Data QA report, which if you've augmented or diversified the data, will show you how the synthetic data differs from the original data.

[00:04:00] As you can see here in the Duration High, we now have approximately 50% of instances of high-duration calls in the synthetic data compared to roughly 11% in the original data.

[00:04:13] If I close here, I can see my key variable y, if the campaign was a success or not, you can see how having more high-duration calls has impacted in a positive way the number of successes I have from my direct marketing campaign. I think that's really, really cool.

[00:04:33] Just to conclude on this point, in the original video, we showed how you can use privacy safe synthetic data to tell the same story as the original production data.

[00:04:43] In this second video, what we've done is shown the pretty cool interplay you can have with Code Interpreter and then how you can augment synthetic data to begin telling a different story from the original data. That's just one of the reasons why we think synthetic data is better than real data. Thanks for watching.

Ready to start?

Sign up for free or contact our sales team to schedule a demo.
magnifiercross