💡 Download the complete guide to AI-generated synthetic data!
Go to the ebook
August 16, 2023
3m 50s

Getting Started with MOSTLY AI - Synthetic Data Training Settings Explained

Trascript

MOSTLY AI's synthetic data generation platform allows you to adjust the settings of the model training. Depending on your goals, you can reduce the amount of data or the number of epochs used to train the synthetic data generator. In this video, we'll show you how to do that and what the expected outcome of these adjustments will be.

Get started for free here ➡️ https://bit.ly/43IGYSv

Transcript

[00:00:01] Hi, everyone. In this video, I want to explain Training settings and what you can configure here.

[00:00:07] Here I have a table that I've provided to the platform, User Data, and I have a couple of options here that I can configure. Actually, if I click here, there's more. Let me start with Training size.

[00:00:18] The way our platform works is that a machine learning model is being trained with the data that you provided to the platform, and that machine model learns all the patterns of that input data.

[00:00:31] Here I can define how much data of that input data should be provided to the model. If I leave this blank, everything that you uploaded or provided to the platform will be fed into the model, but I can limit this. I can say I only want 10,000 rows of data to be provided to the training step, or maybe 15,000, or so.

[00:00:53] If you provide or limit the data that's going into the training here, that will, potentially, speed up the training process, but it will come at the cost of accuracy.

[00:01:04] Another option here is the Maximum Training Epochs. We, again, are using a machine learning approach to learning patterns of the input data, and every time all the data points that you have defined or uploaded that go into the machinery model, once that's called an Epoch. This is an iterative approach.

[00:01:23] It'll take, typically, several epochs before the model has really learned the patterns of the input data, but here we can define when to stop. The platform will automatically stop the training if already the model is saturated and it's learned enough, but this is the hard stop, so to say. The default is 100, but I could reduce this, let's say 50 if I want to make sure that the platform stops earlier.

[00:01:47] and I don't want to spend too much time.

[00:01:50] Actually, these two settings are also controlled here with what we call the Training Goal.

[00:01:56] Per default, it's accuracy, which means we have hundreds epochs and we have the full training size.

[00:02:02] If you go to speed, you will see that the maximum training epoch is reduced to 10, and the training size is limited to 100,000 rows of input data and with Turbo, it's one training epoch and 10,000.

[00:02:15] This is going to be the fastest, but, obviously, with the lowest accuracy. If I want to optimize for accuracy, I'm going to pick accuracy.

[00:02:22] Two other options here is Model size and the Batch size. The Model size really is a parameter that defines the internal model architecture that's being used to generate or to train the model.

[00:02:38] Standard is medium, that works most of the time out of the box really well. If you have a very large data set and you run into issues with out of memory, you can try to use the small model size, which has a lower memory footprint.

[00:02:51] If you really want to go for super high accuracy, you might want to use the large model size that creates the best synthetic data, but also has a large memory footprint. I, typically, recommend keeping this at medium.

[00:03:06] The Batch size is another parameter that you can specify, and you can specify the number of records used for each training step. If you select your larger batch size, it means more data at once is going to be fed into the model and that can speed up the training process.

[00:03:23] However, it will come at the cost of accuracy. If you reduce the batch size here, that means less data will be fed at once into the model which can make it slower, but also with higher accuracy and less memory footprint.

[00:03:37] I, actually, recommend leaving this auto here as well, and then the platform will figure it out for you what's the best batch size.

[00:03:45] Those were the training options.

[00:03:48] Thanks for watching.

Ready to try synthetic data?

The best way to learn about synthetic data is to experiment with synthetic data generation. Try it for free or get in touch with our sales team for a demo.
magnifiercross