Guides
Generate more data

Generate more data

After an Ad-hoc or file-based catalog job is completed, you can reuse their trained synthetic data generation model to generate more data.

For models that are trained on subject table-linked table datasets, you can generate linked tables based on a subject table that you specify. In data science, this is called conditional data generation. You can use this function to simulate sequences for subjects of interest.

Synthesizing continuous data streams

You can process a continuous data stream by dividing it into batches and reusing the first batch’s synthetic subjects to synthesize the forthcoming batches.

Let’s take the example of a continuous transactional data stream, which you can see in the diagram below. When you start processing the first batch of this stream, you begin with generating synthetic customers and their transactions. If you would do the same for the forthcoming batches, you would end up having different synthetic customers for each batch, resulting in an inconsistent synthetic data stream.

By using the same synthetic customers for all forthcoming batches, you’ll be able to deliver a consistent synthetic data stream, which in turn lays the groundwork for more robust downstream analytics and AI-model training.

Generate more data - Stream diagram

How to generate more data

Begin reusing your trained synthetic data generation model by following the steps below. By default, it will synthesize the same number of tables that the model was trained with.

If you want to generate sequence tables conditionally, you can upload a subject table when the Generate more data modal appears. MOSTLY AI will then only return the resulting sequence table.

💡

Do not upload original subject tables or a subsample thereof. As they won’t be synthesized, privacy protection won’t be applied to this subject table.

The resulting synthetic dataset will not have a QA report since the original dataset is no longer available.

Steps

  1. In the Jobs list, locate the generate more data button of the job you want to generate more data from and click on it. Generate more data - Stream diagram
  2. A drawer appears, asking you the number of subjects that you want to generate. If the synthetic data generation model was trained on a subject table-linked table dataset, you see two tabs — Generate by quantity and Generate with seed. The latter provides you the option to upload a subject table. MOSTLY AI will then use this table to generate a linked table that is based on its contents. Generate more data - Generate by quantity or Generate with seed
  3. Next, click Create a synthetic dataset.

Result

The generation job will now appear in the jobs list.

MOSTLY AI will skip the Analyze and Train stages as the model is already trained.