🚀 Launching Synthetic Text to Unlock High-Value Proprietary Text Data
Read all about it here
August 8, 2023
4m 46s

Creating synthetic banking data and consuming in Tableau - Part 1

In this video, we'll generate synthetic data from a banking dataset and uploading it to Tableau to explore how close to the original it is.

You can also try synthetic data generation to anonymize and augment your data at mostly.ai.

Transcript

[00:00:00] Hi, folks. In today's video, we're going to create some synthetic banking data. We're going to look at the QA report on platform to evaluate the accuracy, and then we're going to take that synthetic data off platform, and begin consuming it in a couple of use cases to further analyze that it is, in fact, representative of the original production data. Let's get started.

[00:00:20] On screen now, you can see the homepage of the MOSTLY AI platform. We're going to run a job with the bank marketing data set, which is embedded in the home screen. Just to give you a bit of information, that is the popular UCI data set, and it's related to a direct marketing campaign from a Portuguese banking institution. It's got 45,000 instances and 17 attributes, so really a nice data set with all of the attributes that you would expect to see in a banking data set.

[00:00:46] It's got the housing status, the loan status, and it's also got some information about the direct marketing campaign itself. Let's go back on platform and start the job. This will bring us through here to loads of flexibility, loads of optionality in our data settings, training settings, or relationships. For the purpose of this video, we're going to leave it all at default and just go through to the output settings and choose our data destination.

[00:01:11] Let's kick off the job. It's simple as that to kick off your first synthetic data job. Now let's go into a previously completed bank marketing dataset. What you can see on screen now is immediately a preview of that synthetic data. You can see here all of the synthetic instances, the columns and the rows, matching the original, but let's dig into the QA reporting UI just to understand how representative it is of the original data.

[00:01:36] As you can see here, we have the training samples and the synthetic samples match up,

[00:01:40] and there's the same number of columns. The total accuracy achieved in this run was 98.2% so the synthetic data is 98.2% representative of the original data.

[00:01:50] We have our privacy metrics on screen as well. If we go down to the data QA report, we can also begin to examine the univariate distributions, and you can see lots of tight lines between the synthetic and the original data.

[00:02:04] Let me just zoom in to show you how close that data is. Now that we have the data, let's go ahead and download it off the platform.

[00:02:17] I'm going to go through to a Tableau workbook to begin consuming the data. What we can do first here is pull in the Y column, which is the key column, indicating whether the campaign was a success or not.

[00:02:35] Let's get a sense of how many instances in my synthetic data were successful, so 5,347. I've also got a workbook here with the original synthetic data. As you can see, there are 5,289 instances in the original data, so very, very close between the synthetic and the original data.

[00:02:55] Let's open up a second sheet here and begin some analysis. If we were to pull the duration into the column and change this to average, and then let's go ahead and take the balance and pull it into the rows to create a scatter plot, and you move this to average as well.

[00:03:21] Let's go and move the Y column into color, and finally, let's take the age and put it in as size. Again, we'll change that to average.

[00:03:38] Now as you can see, on screen we have a very basic scatter plot with the average age of the folks who were successfully targeted as part of the bank marketing campaign, 43 years old, the average bank balance is $1,800, and the average duration of the call is 540 seconds.

[00:03:59] If I go back to my original data here and into my relationship analysis, we can see very similar story being told. Of the customers who signed up for the term loan, the average age is 41,

[00:04:13] The average balance is $1,800, and the duration of call is 537.72.

[00:04:20] Very, very similar between the original data on screen now and the synthetic data which we've just generated using the MOSTLY AI platform.

[00:04:28] Just in conclusion, that's obviously a very simple story told with some very basic Tableau skills, but the key message here is that it's the same story, and you can tell it with privacy-safe, representative, synthetic data using the MOSTLY AI platform.

[00:04:42] Thanks for watching.

Ready to start?

Sign up for free or contact our sales team to schedule a demo.
magnifiercross