Creating your first AI-powered synthetic dataset takes very little effort.
This tutorial guides you through the steps of downloading a dataset, uploading it to MOSTLY AI, starting the synthesization job, and downloading the synthetic version of the original dataset.
You won’t explore every nook and cranny of the user interface in this tutorial. Instead, it gives you a concise overview of the synthetic data generation process while delivering concrete results.
You will see how fun and easy it is!
1. Exploring MOSTLY AI’s main menu
Let’s start by exploring the main menu. Depending on your access privileges, you’ll find two or more of the following options in the left side menu: Jobs, Data Catalogs. Documentation, Settings, and Users.
For this quickstart tutorial, we will just focus on the Jobs option.

Once you’ve clicked Jobs, a page appears where you can set up a new job. If you scroll down, you can see the history of previous jobs. Here, you can download the generated synthetic datasets and QA reports.
2. Setting up your first job
To set up a new job, you first need to upload a subject table. A subject is an entity or individual whose privacy you are going to protect. This table further contains their features, such as their height, gender, place of residence, or income.
In addition, you can also choose to upload a linked table if you want to synthesize the behaviors of these subjects. This may include historical activities, transactional records, and customer journeys.
To help you along, we created two datasets for you:
-
US Census Income dataset (download here) with only a subject table.
-
Baseball dataset (download here) with a subject table (players.csv) and a linked table (seasons.csv).
Learn more about these datasets by visiting the Resources section. Here, you can find more details on their structure and contents. |
If you want to synthesize your own dataset, we recommend checking out the Preparing your dataset section.
Please follow the steps below to configure your first job.
-
Drag and drop your tables into their respective upload areas.
-
First, drag
us-census-income.csv
orplayers.csv
into the left upload area. -
If you want to synthesize the complete Baseball dataset, click on the
Add new table
icon and dragseasons.csv
to this upload area. -
Next, click on
Proceed
and wait for the files to upload.
-
-
You’ll now see a
Settings
andTable details
tab. Or, if you uploaded two tables in the previous step, you’d also see aRelationships
tab. MOSTLY AI analyzed a sample of the dataset you uploaded and preconfigured the available settings in these tabs.
If you want to, you could just click onLaunch Job
and immediately start the synthetic data generation job. Instead, let’s reward our curiosity and explore the various settings that we can configure.
In the settings tab shown below, you can specify the number of training and generated subjects. This allows you to do some nifty things, such as increasing the size of small datasets or creating representative subsets of your data. Learn more about these settings in Step 4 of the Ad hoc jobs section.
-
The
Relationships
tab only appears if you uploaded a subject table and a linked table. Here you link the two tables by specifying the primary and foreign key.
If your subject table has anid
column and your linked table has a column name containing_id
, then these tables will be automatically linked.
-
In the
Table details
tab, you can optionally configure how MOSTLY AI will process your database’s columns during synthetization. Here you can include or exclude columns, tailor their processing to your use case, and replace the contents of certain types of columns with mock data.
TheTable details
tab is divided into two panes. The left pane lists your dataset’s tables, and the right pane lists the columns of these tables. To learn more about these settings, please visit Step 7 of the Ad hoc jobs section.
-
Once you’ve seen everything, scroll down, and click on
Launch Job
to generate a synthetic version of the uploaded dataset.
3. Observing the synthesization job
A page appears that informs you about the status of your job.
You can now sit back, relax, and let MOSTLY AI do its work.
There are two sections on this page that inform you about the synthetic data generation process:
-
The top section provides general information:
Job name The name of the job as it was specified during its configuration.
Job type There are four jobs types:
-
Ad hoc synthesizes a dataset uploaded using the web UI.
-
Data catalog synthesizes a database or dataset stored in a cloud bucket
or local server. -
Generate with subject count creates a specified number of new synthetic
subjects from a previous job’s readily trained AI model. -
Generate with seed creates a linked table for an uploaded subject table
using a previous job’s readily trained AI model.
Uploaded This field indicates when the original dataset was uploaded.
-
-
The
Job summary
section informs you about the synthesization tasks currently being performed. It shows which tables are being synthesized, the current tasks, their status, and the total duration. In addition, you can click on thekebab
icon on the right side of each entry to see a detailed task list or an overview of the columns' generation methods and encoding types.A task list appears when you choose
View tasks
from the kebab menu. The table below provides an overview of all the tasks and steps you will see in this list.Task Step Description Synthetizing table
Generating textOrganizing data
Ensures that very large tables can be processed regardless of system memory size.
Data analysis
The table is analyzed for its data types and unique values.
Transforming data
The table is transformed for efficient processing.
AI training
Using generative neural networks, a model is trained to retain your dataset’s granularity, statistical correlations, structures, and time-dependencies.
Generating synthetic data
The resulting AI model is used to create a synthetic version of the table.
Packaging synthetic data
Creating zip archive
Creates a ZIP archive with the synthetic version of the dataset.
Creating the quality assurance
reportAnalyzing synthetic data for quality and accuracy
The resulting synthetic table is tested against the original for accuracy and privacy. It checks for identical information matches and whether the synthetic subjects are dissimilar enough to the original subjects to prevent re-identification.
To learn more about the AI training step, you can click on
View training logs
to see how the AI model training is going. Here you can find a chart depicting the training and validation loss per epoch.If you consider the model to be sufficiently trained, or you want to speed up the synthetization process, you can click on
Stop training
to skip to the synthetic data generation step.
4. Evaluating and downloading your synthetic data
Once MOSTLY AI completes the synthetization job, a QA report appears with general information about the synthetic data’s accuracy, whether it passed the privacy tests, the number and type of columns, and the number of generated subjects.

The green Overall accuracy
and Privacy tests
checkmarks indicate whether the synthetization was successful and that you can share the synthetic data across your business and partnerships.
If you want to see the detailed privacy and accuracy charts, click on View full report
and scroll down. We also created a guide to help you better understand all the privacy and accuracy metrics.

To download your synthetic data, go to the job details at the top, click on the kebab icon on the right, and select Download synthetic data
from the menu.

On your computer, navigate to the Downloads
folder and open the .ZIP
file, you’ll find the same number of .CSV
files as in your original dataset, and they have the same columns and number of rows.
What’s changed is that the content is entirely composed of fictional characters that do not reveal any information on your original subjects.
If you have uploaded a dataset in Parquet format in step 2a, your synthetic dataset will also be in this format. In this case, each table may have been partitioned into different files. This is a feature of the Parquet format that will help with efficient processing in downstream tasks. |
5. Further reading
Congratulations, you’ve created your first synthetic dataset in only a few easy steps!
If you want to learn more about all the configuration options available to you, please read our comprehensive guide to configuring a synthetization job.
We also created a guide to help you better understand all the details of the QA report. This report provides an in-depth analysis of how accurately your original dataset’s statistical features were reproduced in the synthetic version.