Creating your first AI-powered synthetic dataset takes very little effort.

This tutorial guides you through the steps of downloading a dataset, uploading it to MOSTLY AI, starting the synthesization job, and downloading the synthetic version of the original dataset.

You won’t explore every nook and cranny of the user interface in this tutorial. Instead, it gives you a concise overview of the synthetic data generation process while delivering concrete results.

You will see how fun and easy it is!

1. Exploring MOSTLY AI’s main menu

Let’s start by exploring the main menu. Depending on your access privileges, you’ll find two or more of the following options in the left side menu: Jobs, Data Catalogs. Documentation, Settings, and Users.

For this quickstart tutorial, we will just focus on the Jobs option.

MOSTLY AI’s User Interface

Once you clicked on the Jobs option, a page will appear that allows you to set up a new job. If you scroll down, you can see the history of previous jobs. Here, you can download their synthetic datasets and QA reports.

2. Setting up your first job

To set up a new job, you first need to upload a subject table. A subject is an entity or individual whose privacy you are going to protect. This table further contains their features, such as their height, gender, place of residence, or income.

In addition, you can also choose to upload a linked table if you want to synthesize the behaviors of these subjects. This may include historical activities, transactional records, and customer journeys.

To help you along, we created two datasets for you:

  • US Census Income dataset (download here) with only a subject table.

  • Baseball dataset (download here) with a subject table (players.csv) and a linked table (seasons.csv).

Learn more about these datasets by visiting the Resources section.
Here, you can find more details on their structure and contents.

If you want to synthesize your own dataset, we recommend checking out the Preparing your dataset section.

Please follow the three steps below to configure your first job.

  1. Drag and drop your tables into their respective upload areas.

    • First, drag us-census-income.csv or players.csv into the left upload area.

    • If you want to synthesize the complete Baseball dataset, click on the Add new table icon and drag seasons.csv to this upload area.

    • Next, click on Proceed and wait for the files to upload.

      quickstart 2 upload files


  2. You’ll now see a Settings and Table details tab. Or, if you uploaded two tables in the previous step, you’d also see a Relationships tab. MOSTLY AI analyzed a sample of the dataset you uploaded and preconfigured the available settings in these tabs.

    If you want to, you could just click on Launch Job and immediately start the synthetic data generation job. Instead, let’s reward our curiosity and explore the various settings that we can configure.

    In the settings tab shown below, you can specify the number of training and generated subjects. This allows you to do some nifty things, such as increasing the size of small datasets or creating representative subsets of your data. Learn more about these settings in Step 4 of the Ad hoc jobs section.

    quickstart 3 general settings tab


  3. The Relationships tab only appears if you uploaded a subject table and a linked table. Here you link the two tables by specifying the primary and foreign key.

    If your subject table has an id column and your linked table has a column name containing _id, then these tables will be automatically linked.

    quickstart 4 edit relationships tab


  4. In the Table details tab, you can optionally configure how MOSTLY AI will process your database’s columns during synthetization. Here you can include or exclude columns, tailor their processing to your use case, and replace the contents of certain types of columns with mock data.

    The Table details tab is divided into two panes. The left pane lists your dataset’s tables, and the right pane lists the columns of these tables. To learn more about these settings, please visit Step 7 of the Ad hoc jobs section.

    quickstart 5 column details tab


  5. Once you’ve seen everything, scroll down, and click on Launch Job to generate a synthetic version of the uploaded dataset.

3. Observing the synthesization job

A page appears that informs you about the status of your job.
You can now sit back, relax, and let MOSTLY AI do its work.

There are two sections on this page that inform you about the synthetic data generation process:

  1. The top section provides general information:

    Job name

    The name of the job as it was specified during its configuration.

    Job type

    There are four jobs types:

    • Ad hoc synthesizes a dataset uploaded using the web UI.

    • Data catalog synthesizes a database or dataset stored in a cloud bucket
      or local server.

    • Generate with subject count creates a specified number of new synthetic
      subjects from a previous job’s readily trained AI model.

    • Generate with seed creates a linked table for an uploaded subject table
      using a previous job’s readily trained AI model.

    Uploaded

    This field indicates when the original dataset was uploaded.

    quick job 6 job summary dataset details


  2. The Job summary section informs you about the synthesization tasks currently being performed. It shows which tables are being synthesized, the current tasks, their status, and the total duration. In addition, you can click on the kebab icon on the right side of each entry to see a detailed task list or an overview of the columns' generation methods and encoding types.

    feat job summary table list

    A task list appears when you choose View tasks from the kebab menu. The table below provides an overview of all the tasks and steps you will see in this list.

    feat job summary task list
    Task Step Description

    Synthetizing table
    Generating text

    Organizing data

    Ensures that very large tables can be processed regardless of system memory size.

    Data analysis

    The table is analyzed for its data types and unique values.

    Transforming data

    The table is transformed for efficient processing.

    AI training

    Using generative neural networks, a model is trained to retain your dataset’s granularity, statistical correlations, structures, and time-dependencies.

    Generating synthetic data

    The resulting AI model is used to create a synthetic version of the table.

    Packaging synthetic data

    Creating zip archive

    Creates a ZIP archive with the synthetic version of the dataset.

    Creating the quality assurance
    report

    Analyzing synthetic data for quality and accuracy

    The resulting synthetic table is tested against the original for accuracy and privacy. It checks for identical information matches and whether the synthetic subjects are dissimilar enough to the original subjects to prevent re-identification.

    To learn more about the AI training step, you can click on View training logs to see how the AI model training is going. Here you can find a chart depicting the training and validation loss per epoch.

    feat job summary training log

    If you consider the model to be sufficiently trained, or you want to speed up the synthetization process, you can click on Stop training to skip to the synthetic data generation step.

4. Evaluating and downloading your synthetic data

Once MOSTLY AI completes the synthetization job, a QA report appears with general information about the synthetic data’s accuracy, whether it passed the privacy tests, the number and type of columns, and the number of generated subjects.

QA report

The green Overall accuracy and Privacy tests checkmarks indicate whether the synthetization was successful and that you can share the synthetic data across your business and partnerships.

If you want to see the detailed privacy and accuracy charts, click on View full report and scroll down. We also created a guide to help you better understand all the privacy and accuracy metrics.

QA report

To download your synthetic data, go to the job details at the top, click on the kebab icon on the right, and select Download synthetic data from the menu.

Download

On your computer, navigate to the Downloads folder and open the .ZIP file, you’ll find the same number of .CSV files as in your original dataset, and they have the same columns and number of rows.

What’s changed is that the content is entirely composed of fictional characters that do not reveal any information on your original subjects.

If you have uploaded a dataset in Parquet format in step 2a, your synthetic dataset will also be in this format. In this case, each table may have been partitioned into different files. This is a feature of the Parquet format that will help with efficient processing in downstream tasks.

5. Further reading

Congratulations, you’ve created your first synthetic dataset in only a few easy steps!

If you want to learn more about all the configuration options available to you, please read our comprehensive guide to configuring a synthetization job.

We also created a guide to help you better understand all the details of the QA report. This report provides an in-depth analysis of how accurately your original dataset’s statistical features were reproduced in the synthetic version.