[00:00:00] Hello, synthetic data users. This is Nick from MOSTLY AI, and in this video, we're going to walk through how to create synthetic time series data in just a few simple steps within the platform. Let's get started.
[00:00:12] In this demo, we're going to model and generate purchasing behavior from a set of synthetic customers where we have several years of real transactional event data available to us. We can review the accuracy of the synthetic dataset directly inside the MOSTLY AI platform, as well as run a few additional tests to show how similar the behavior of our synthetic customers is compared to the original source.
[00:00:39] First of all, time series data needs a little pre-processing to ensure that our data is in the correct format for synthetic generation. Let's talk about our subjects, the customers whose privacy we'll be protecting with synthetic data. Now, the data for our subjects, including any related fields or attributes, have been prepared into a single data set with one row per subject.
[00:01:04] Let's load this into the MOSTLY AI platform by dragging and dropping our data set or simply browsing for the file that we need. We can rename our subject table if we wish, but customers works just fine here, so let's click proceed. Once the data is uploaded, let's head over to the data settings tab and take a closer look at our columns.
[00:01:27] Now, with MOSTLY AI, automatically making some strong initial choices about how to work with each field in the customer data, I'm going to click on the settings icon for the ID column and tell our synthetic data model that we'd like to treat this field as a primary key rather than just a numeric field. We use the dropdown, select primary key from our list, and for our generation format, we can pick from a number of synthetic key options, but I'm going to keep things simple with a sequential number that uniquely identifies each customer.
[00:02:02] in our data. Let's click save and head back to our tables tab.
[00:02:08] Now, at this point, if we click on the relationship diagram button, we can see that we're working with just a single table, our customers.
[00:02:16] Let's close this. Let's click add table to bring in our second data set that we've prepared in advance to just contain the time series or behavioral event data that links back to our customers.
[00:02:30] In this case, we've called this dataset purchases, and we can proceed to load this into the platform just like before.
[00:02:38] Now, click on the relationship diagram button once more. We can see that MOSTLY is now working with two isolated datasets. No connections have been defined between them just yet,
[00:02:48] so let's close this view, head back to our data settings tab once more, and this time, let's select the purchases table from the list, configuring it to connect these data sources together.
[00:03:01] Once again, in the settings for the user's ID field, let's update our generation method, this time to foreign key, meaning that every value here relates to a customer elsewhere in the dataset.
[00:03:15] Our foreign key type will be context, which allows MOSTLY AI to fully retain all the correlations within the data. Our parent table will be customers and our parent primary key is ID, that we specified earlier.
[00:03:30] Let's click save and it's as simple as that. In fact, we can head back to our table's view and take a look at the relationship diagram once more.
[00:03:40] Now, we are presented with a data model that uses the crow's feet notation to say that each customer can now have zero, one, or many purchases. Exactly what we need to model our time series data.
[00:03:54] We can tweak our model a little further at this point by clicking on the purchases table and alongside adjusting our training goal, setting our machine learning parameters and model size.
[00:04:02] We can also adjust the sequence lengths
[00:04:08] for our model to learn the behaviors in the data. As you can see, we are presented with a default sequence length of 1000 records per subject, which translates to 1000 purchases per customer in our case. Now, we have the option to restrict the model training to this sequence length, or we can use the dropdown to keep all sequences regardless of their length, although this may affect model training time. Again, a little simple data prep can tell us exactly how many purchases our most engaged customers have made historically, so our defaults will work just fine here.
[00:04:47] Let's come out of these training settings and let's kick off our process to create a synthetic time series data set. Once our model has completed its training, we can see that we've got a large volume of generated rows here in our output, and we can preview some samples in the browser. These are our synthetically generated customer records with all of our attributes represented here, but without any linkage back to the original dataset, ensuring that the data you see here is privacy-safe and ready for wider collaboration.
[00:05:22] Because we have a model containing customers and purchases, we can use the dropdown to take a look at our synthetic purchase data, and here we can see our user behavior data in a time series format with multiple purchases over time by a single customer with realistic synthetic purchases and associated revenue.
[00:05:44] Let's head down to our QA report and we can check our model for accuracy, both for customers and purchases data independently. For our customer subjects, we can see our correlation matrices' accuracy and distributions between the original and synthetic data, and for our linked purchases, we get additional QA data around coherence, helping us to understand how well the synthetic time series sticks to the patterns that are found in the original distributions,
[00:06:16] in this case, tracking the source data extremely well. We can now use our synthetic data model to generate more behavioral time series data to collaborate with others or to simply download our synthetic dataset in CSV or parquet formats.
[00:06:35] Let's take our synthetic data records and run a couple more tests outside of the MOSTLY AI platform. Now you can do this kind of validation in any data visualization or self-service analytics tool. Here, I'm going to be using a Jupyter notebook to run a few short lines of Python code against our original and synthetic purchase data. We'll start by loading in our Python libraries for data exploration and visualization, connect to our data sets, and then use the Pandas library to calculate some basic descriptive statistics for both sets of data, focusing on how the number of CDs purchased and the total revenue compares for these files.
[00:07:18] Now, we can review these as plain numbers, but let's visualize the distributions using box plots, which shows us quartiles and where outliers go outside of these ranges. From the plot, we get an immediate sense that the synthetic data has fewer extreme values. Remember, this is by design so that the synthetic dataset doesn't reveal or leak exceptional records in the data when it generates new records. We also get a sense that the main spread of the data is very similar for both original and synthetic versions, so let's take a closer look at that now.
[00:07:55] We can plot bar charts for median and mean values for our purchases and revenue, and once again, we see a very tight alignment. A slightly higher mean for the original data set, since it'll be influenced by those outlier values, but really close overall. Let's also evaluate how the time series itself behaves for purchases and revenue between the data sets,
[00:08:18] and we can plot these metrics on a monthly basis and see that we have a very tight, very coherent relationship here with the synthetic data reflecting the time series trends for purchase volume, and down here tracking highly realistic synthetic revenues over time as well.
[00:08:37] With MOSTLY AI, we've accurately modeled a complex time series relationship for consumer buying behavior, and we've used this model to generate high-fidelity synthetic versions of both the subjects and their purchasing activity, ensuring that we now have a privacy-safe data set that can be used for collaboration, analytics, and machine learning initiatives.
[00:09:01] Ready to synthesize your own time series or behavioral data? Head over to MOSTLY AI and sign up for free on our platform to get started today.