Creating your first AI-powered synthetic dataset takes very little effort.
This tutorial guides you through the steps of downloading a dataset, uploading it to MOSTLY AI, starting the synthesization job, and downloading the synthetic version of the original dataset.
You won’t explore every nook and cranny of the user interface in this tutorial. Instead, it gives you a concise overview of the synthetic data generation process while delivering concrete results.
You will see how fun and easy it is!
Let’s start by exploring the main menu. Depending on your access privileges, you’ll find two or more of the following options in the left side menu: Jobs, Data Catalogs. Documentation, Settings, and Users.
For this quickstart tutorial, we will just focus on the Jobs option.
Once you clicked on the Jobs option, a page will appear that allows you to set up a new job. If you scroll down, you can see the history of previous jobs. Here, you can download their synthetic datasets and QA reports.
To set up a new job, you first need to upload a subject table. A subject is an entity or individual whose privacy you are going to protect. This table further contains their features, such as their height, gender, place of residence, or income.
In addition, you can also choose to upload a linked table if you want to synthesize the behaviors of these subjects. This may include historical activities, transactional records, and customer journeys.
To help you along, we created two datasets for you:
Learn more about these datasets by visiting the Resources section.
Here, you can find more details on their structure and contents.
If you want to synthesize your own dataset, we recommend checking out the Preparing your dataset section.
Please follow the three steps below to configure your first job.
Drag and drop your tables into their respective upload areas.
players.csvinto the left upload area.
If you want to synthesize the complete Baseball dataset, click on the
Add new tableicon and drag
seasons.csvto this upload area.
Next, click on
Proceedand wait for the files to upload.
You’ll now see a
Table detailstab. Or, if you uploaded two tables in the previous step, you’d also see a
Relationshipstab. MOSTLY AI analyzed a sample of the dataset you uploaded and preconfigured the available settings in these tabs.
If you want to, you could just click on
Launch Joband immediately start the synthetic data generation job. Instead, let’s reward our curiosity and explore the various settings that we can configure.
In the settings tab shown below, you can specify the number of training and generated subjects. This allows you to do some nifty things, such as increasing the size of small datasets or creating representative subsets of your data. Learn more about these settings in Step 4 of the Ad hoc jobs section.
Relationshipstab only appears if you uploaded a subject table and a linked table. Here you link the two tables by specifying the primary and foreign key.
If your subject table has an
idcolumn and your linked table has a column name containing
_id, then these tables will be automatically linked.
Table detailstab, you can optionally configure how MOSTLY AI will process your database’s columns during synthetization. Here you can include or exclude columns, tailor their processing to your use case, and replace the contents of certain types of columns with mock data.
Table detailstab is divided into two panes. The left pane lists your dataset’s tables, and the right pane lists the columns of these tables. To learn more about these settings, please visit Step 7 of the Ad hoc jobs section.
Once you’ve seen everything, scroll down, and click on
Launch Jobto generate a synthetic version of the uploaded dataset.
A page appears that informs you about the status of your job.
You can now sit back, relax, and let MOSTLY AI do its work.
There are two sections on this page that inform you about the synthetic data generation process:
The top section provides general information:
The name of the job as it was specified during its configuration.
There are four jobs types:
Ad hoc synthesizes a dataset uploaded using the web UI.
Data catalog synthesizes a database or dataset stored in a cloud bucket
or local server.
Generate with subject count creates a specified number of new synthetic
subjects from a previous job’s readily trained AI model.
Generate with seed creates a linked table for an uploaded subject table
using a previous job’s readily trained AI model.
This field indicates when the original dataset was uploaded.
Job summarysection informs you about the synthesization tasks currently being performed. It shows which tables are being synthesized, the current tasks, their status, and the total duration. In addition, you can click on the
kebabicon on the right side of each entry to see a detailed task list or an overview of the columns' generation methods and encoding types.
A task list appears when you choose
View tasksfrom the kebab menu. The table below provides an overview of all the tasks and steps you will see in this list.
Task Step Description
Ensures that very large tables can be processed regardless of system memory size.
The table is analyzed for its data types and unique values.
The table is transformed for efficient processing.
Using generative neural networks, a model is trained to retain your dataset’s granularity, statistical correlations, structures, and time-dependencies.
Generating synthetic data
The resulting AI model is used to create a synthetic version of the table.
Packaging synthetic data
Creating zip archive
Creates a ZIP archive with the synthetic version of the dataset.
Creating the quality assurance
Analyzing synthetic data for quality and accuracy
The resulting synthetic table is tested against the original for accuracy and privacy. It checks for identical information matches and whether the synthetic subjects are dissimilar enough to the original subjects to prevent re-identification.
To learn more about the AI training step, you can click on
View training logsto see how the AI model training is going. Here you can find a chart depicting the training and validation loss per epoch.
If you consider the model to be sufficiently trained, or you want to speed up the synthetization process, you can click on
Stop trainingto skip to the synthetic data generation step.
Once MOSTLY AI completes the synthetization job, a QA report appears with general information about the synthetic data’s accuracy, whether it passed the privacy tests, the number and type of columns, and the number of generated subjects.
Overall accuracy and
Privacy tests checkmarks indicate whether the synthetization was successful and that you can share the synthetic data across your business and partnerships.
If you want to see the detailed privacy and accuracy charts, click on
View full report and scroll down. We also created a guide to help you better understand all the privacy and accuracy metrics.
To download your synthetic data, go to the job details at the top, click on the kebab icon on the right, and select
Download synthetic data from the menu.
On your computer, navigate to the
Downloads folder and open the
.ZIP file, you’ll find the same number of
.CSV files as in your original dataset, and they have the same columns and number of rows.
What’s changed is that the content is entirely composed of fictional characters that do not reveal any information on your original subjects.
|If you have uploaded a dataset in Parquet format in step 2a, your synthetic dataset will also be in this format. In this case, each table may have been partitioned into different files. This is a feature of the Parquet format that will help with efficient processing in downstream tasks.|
Congratulations, you’ve created your first synthetic dataset in only a few easy steps!
If you want to learn more about all the configuration options available to you, please read our comprehensive guide to configuring a synthetization job.
We also created a guide to help you better understand all the details of the QA report. This report provides an in-depth analysis of how accurately your original dataset’s statistical features were reproduced in the synthetic version.