Creating your first AI-powered synthetic dataset takes very little effort.
This tutorial guides you through the steps of downloading a dataset, uploading it to MOSTLY AI, starting the synthesization job, and downloading the synthetic version of the original dataset.
You won’t explore every nook and cranny of the user interface in this tutorial. Instead, it gives you a concise overview of the synthetic data generation process while delivering concrete results.
You will see how fun and easy it is!
1. Exploring MOSTLY AI’s main menu
Let’s start by exploring the main menu. Depending on your access privileges, you’ll find two or more of the following options in the left side menu: Jobs, Data Catalogs. Documentation, Settings, and Users.
For this quickstart tutorial, we will just focus on the Jobs option.

Once you clicked on the Jobs option, a page will appear that allows you to set up a new job. If you scroll down, you can see the history of previous jobs. Here, you can download their synthetic datasets and QA reports.
2. Setting up your first job
To set up a new job, you first need to upload a subject table. A subject is an entity or individual whose privacy you are going to protect. This table further contains their features, such as their height, gender, place of residence, or income.
In addition, you can also choose to upload an event table if you want to synthesize the behaviors of these subjects. This may include historical activities, transactional records, and customer journeys.
To help you along, we created two datasets for you:
-
US Census Income dataset (download here) with only a subject table.
-
Baseball dataset (download here) with a subject table (players.csv) and an event table (seasons.csv).
Learn more about these datasets by visiting the Resources section. Here, you can find more details on their structure and contents. |
Please follow the three steps below to configure your first job.
-
Drag and drop your tables into their respective upload areas.
-
First, drag
us-census-income.csv
orplayers.csv
into the left upload area. -
If you want to synthesize the complete Baseball dataset, click on the
Add new table
icon and dragseasons.csv
to this upload area. -
Next, click on
Proceed
and wait for the files to upload.
-
-
Three tabs appear on the screen—
Settings
,Relationships
, andColumn details
. MOSTLY AI analyzed a sample of the dataset you uploaded and preconfigured the available settings in these tabs.
If you want to, you could just click onLaunch Job
and immediately start the synthetic data generation job. Instead, let’s reward our curiosity and explore the various settings that we can configure.
In the settings tab shown below, you can specify the number of training and generated subjects. This allows you to do some nifty things, such as increasing the size of small datasets or creating representative subsets of your data. Learn more about these settings in Step 4 of the Ad hoc jobs section.
-
The
Relationships
tab only appears if you uploaded a subject table and an event table. Here you link the two tables by specifying the primary and foreign key.
If your subject table has anid
column and your event table has a column name containing_id
, then these tables will be automatically linked.
-
In the
Column details
tab, you can optionally configure how MOSTLY AI will process your database’s columns during synthetization. Here you can include or exclude columns, tailor their processing to your use case, and replace the contents of certain types of columns with mock data.
TheColumn details
tab is divided into two panes. The left pane lists your dataset’s tables, and the right pane lists the columns of these tables. To learn more about these settings, please visit Step 7 of the Ad hoc jobs section.
-
Once you’ve seen everything, scroll down, and click on
Launch Job
to generate a synthetic version of the uploaded dataset.
3. Observing the synthesization job
A page appears that informs you about the status of your job.
You can now sit back, relax, and let MOSTLY AI do its work.
There are two sections on this page that inform you about the synthetic data generation process:
-
The top section provides general information:
Job name The name of the job as it was specified during its configuration.
Job type There are four jobs types:
-
Ad hoc synthesizes a dataset uploaded using the web UI.
-
Data catalog synthesizes a database or dataset stored in a cloud bucket
or local server. -
Generate with subject count creates a specified number of new synthetic
subjects from a previous job’s readily trained AI model. -
Generate with seed creates a linked table for an uploaded subject table
using a previous job’s readily trained AI model.
Uploaded This field indicates when the original dataset was uploaded.
-
-
The
Job summary
section tells you which stage the synthesization process is in.
Your dataset goes through the following six stages before you can download the synthetic version:Submitted MOSTLY AI received your dataset and run configuration.
Provisioning Compute resources are being allocated to your run.
Encoding Your dataset is analyzed for its data types and unique values and transformed for efficient processing.
Training Using generative neural networks, a model is trained to retain your dataset’s granularity, statistical correlations, structures, and time-dependencies.
Generating Without having access to your dataset, MOSTLY AI uses the resulting model to create a synthetic version of your dataset.
Analyzing The resulting synthetic copy is tested against the original data for accuracy and privacy. It checks for identical information matches and whether the synthetic subjects are dissimilar enough to the original subjects to prevent re-identification. MOSTLY AI discards the original dataset once this stage is completed.
4. Evaluating and downloading your synthetic data
Once MOSTLY AI completes the synthetization job, a QA report appears with general information about the synthetic data’s accuracy, whether it passed the privacy tests, the number and type of columns, and the number of generated subjects.

The green Overall accuracy
and Privacy tests
checkmarks indicate whether the synthetization was successful and that you can share the synthetic data across your business and partnerships.
If you want to see the detailed privacy and accuracy charts, click on View full report
and scroll down. We also created a guide to help you better understand all the privacy and accuracy metrics.

To download your synthetic data, click on View job summary
and then on Download synthetic data
. You can find this button in the Generating
section of the job summary.

On your computer, navigate to the Downloads
folder and open the .ZIP
file, you’ll find the same number of .CSV
files as in your original dataset, and they have the same columns and number of rows.
What’s changed is that the content is entirely composed of fictional characters that do not reveal any information on your original subjects.
5. Further reading
Congratulations, you’ve created your first synthetic dataset in only a few easy steps!
If you want to learn more about all the configuration options available to you, please read our comprehensive guide to configuring a synthetization job.
We also created a guide to help you better understand all the details of the QA report. This report provides an in-depth analysis of how accurately your original dataset’s statistical features were reproduced in the synthetic version.