Once a job is completed, you can reuse the trained synthetic data generation model to generate more data.
For models that are trained on sequential and time-series datasets, you can generate sequence tables based on a subject table that you specify. In data science, this is called conditional data generation. You can use this function to simulate sequences for subjects of interest, but it also opens up the opportunity to process continuous data streams and multi-table setups.
You can process a continuous data stream by dividing it into batches and reusing the first batch’s synthetic subjects to synthesize the forthcoming batches.
Let’s take the example of a continuous transactional data stream, which you can see in the diagram below. When you start processing the first batch of this stream, you begin with generating synthetic customers and their transactions. If you would do the same for the forthcoming batches, you would end up having different synthetic customers for each batch, resulting in an inconsistent synthetic data stream.
By using the same synthetic customers for all forthcoming batches, you’ll be able to deliver a consistent synthetic data stream, which in turn lays the groundwork for more robust downstream analytics and AI-model training.
Another use case for conditional data generation is to synthesize multi-table setups. Let’s consider the setup that’s depicted in the diagram below. It consists of a subject table, two tables that are linked to this table, and a table that’s linked to one of the other linked tables.
To generate a synthetic version of this dataset, the linked tables need to be processed using a synthetic subject table and would require taking the following steps:
First, synthesize the tables that you’ll need as synthetic subject tables to process the other linked tables in the next steps. These are Table A and Table B.
Next, use the resulting synthetic version of Table A to synthesize linked table C.
And lastly, use the synthetic version of Table B to synthesize linked table D.
The result of this process is a synthetic copy of the original dataset that didn’t lose any of its features, usability, granularity, or statistical correlations.
Begin reusing your trained synthetic data generation model by following the steps below. By default, it will synthesize the same number of tables that the model was trained with.
If you want to generate sequence tables conditionally, like in the examples provided in sections 5.1 and 5.2, you can upload a subject table when the
Generate more data modal appears. MOSTLY AI will then only return the resulting sequence table.
|Please do not upload original subject tables or a subsample thereof. As they won’t be synthesized, privacy protection won’t be applied to this subject table.|
|The resulting synthetic dataset will not have a QA report since the original dataset is no longer available.|
Job summary, locate the
Generationpane and click on
Generate more data.
A modal appears, asking you the number of subjects that you want to generate.
If the synthetic data generation model was trained on a sequential dataset, you will have the option to upload a subject table. MOSTLY AI will then use this table to generate a sequence table that is based on its contents.
Next, click on
Generateto start the job. A new status page will appear. MOSTLY AI will skip the
Analysisstages as the model is already trained.
Downloadssection appears once MOSTLY AI finished the synthesization job. Click on the button to download your synthesized dataset.