Creating your first AI-powered synthetic dataset takes very little effort.

This tutorial guides you through the steps of downloading a dataset, uploading it to MOSTLY AI, starting the synthesization job, and downloading the synthetic version of the original dataset.

You won’t explore every nook and cranny of the user interface in this tutorial. Instead, it gives you a concise overview of the synthetic data generation process while delivering concrete results.

You will see how fun and easy it is!

1. Exploring MOSTLY AI’s main menu

Let’s start by exploring the main menu. Depending on your access privileges, you’ll find two or more of the following options in the left side menu: Jobs, Data Catalogs. Documentation, Settings, and Users.

For this quickstart tutorial, we will just focus on the Jobs option.

MOSTLY AI’s User Interface

Once you clicked on the Jobs option, a page will appear that allows you to set up a new job. If you scroll down, you can see the history of previous jobs. Here, you can download their synthetic datasets and QA reports.

2. Setting up your first job

To set up a new job, you first need to upload a subject table. A subject is an entity or individual whose privacy you are going to protect. This table further contains their features, such as their height, gender, place of residence, or income.

In addition, you can also choose to upload an event table if you want to synthesize the behaviors of these subjects. This may include historical activities, transactional records, and customer journeys.

To help you along, we created two datasets for you:

  • US Census Income dataset (download here) with only a subject table.

  • Baseball dataset (download here) with a subject table (players.csv) and an event table (seasons.csv).

Learn more about these datasets by visiting the Resources section.
Here, you can find more details on their structure and contents.

Please follow the three steps below to configure your first job.

  1. Drag and drop your tables into their respective upload areas.

    • First, drag us-census-income.csv or players.csv into the left upload area.

    • If you want to synthesize the complete Baseball dataset, click on the Add new table icon and drag seasons.csv to this upload area.

    • Next, click on Proceed and wait for the files to upload.

      quickstart 2 upload files


  2. Three tabs appear on the screen—Settings, Relationships, and Column details. MOSTLY AI analyzed a sample of the dataset you uploaded and preconfigured the available settings in these tabs.

    If you want to, you could just click on Launch Job and immediately start the synthetic data generation job. Instead, let’s reward our curiosity and explore the various settings that we can configure.

    In the settings tab shown below, you can specify the number of training and generated subjects. This allows you to do some nifty things, such as increasing the size of small datasets or creating representative subsets of your data. Learn more about these settings in Step 4 of the Ad hoc jobs section.

    quickstart 3 general settings tab


  3. The Relationships tab only appears if you uploaded a subject table and an event table. Here you link the two tables by specifying the primary and foreign key.

    If your subject table has an id column and your event table has a column name containing _id, then these tables will be automatically linked.

    quickstart 4 edit relationships tab


  4. In the Column details tab, you can optionally configure how MOSTLY AI will process your database’s columns during synthetization. Here you can include or exclude columns, tailor their processing to your use case, and replace the contents of certain types of columns with mock data.

    The Column details tab is divided into two panes. The left pane lists your dataset’s tables, and the right pane lists the columns of these tables. To learn more about these settings, please visit Step 7 of the Ad hoc jobs section.

    quickstart 5 column details tab


  5. Once you’ve seen everything, scroll down, and click on Launch Job to generate a synthetic version of the uploaded dataset.

3. Observing the synthesization job

A page appears that informs you about the status of your job.
You can now sit back, relax, and let MOSTLY AI do its work.

There are two sections on this page that inform you about the synthetic data generation process:

  1. The top section provides general information:

    Job name

    The name of the job as it was specified during its configuration.

    Job type

    There are four jobs types:

    • Ad hoc synthesizes a dataset uploaded using the web UI.

    • Data catalog synthesizes a database or dataset stored in a cloud bucket
      or local server.

    • Generate with subject count creates a specified number of new synthetic
      subjects from a previous job’s readily trained AI model.

    • Generate with seed creates a linked table for an uploaded subject table
      using a previous job’s readily trained AI model.

    Uploaded

    This field indicates when the original dataset was uploaded.

    quick job 6 job summary dataset details


  2. The Job summary section tells you which stage the synthesization process is in.
    Your dataset goes through the following six stages before you can download the synthetic version:

    Submitted

    MOSTLY AI received your dataset and run configuration.

    Provisioning

    Compute resources are being allocated to your run.

    Encoding

    Your dataset is analyzed for its data types and unique values and transformed for efficient processing.

    Training

    Using generative neural networks, a model is trained to retain your dataset’s granularity, statistical correlations, structures, and time-dependencies.

    Generating

    Without having access to your dataset, MOSTLY AI uses the resulting model to create a synthetic version of your dataset.

    Analyzing

    The resulting synthetic copy is tested against the original data for accuracy and privacy. It checks for identical information matches and whether the synthetic subjects are dissimilar enough to the original subjects to prevent re-identification. MOSTLY AI discards the original dataset once this stage is completed.

    quick job 6 job summary execution details


4. Evaluating and downloading your synthetic data

Once MOSTLY AI completes the synthetization job, a QA report appears with general information about the synthetic data’s accuracy, whether it passed the privacy tests, the number and type of columns, and the number of generated subjects.

QA report

The green Overall accuracy and Privacy tests checkmarks indicate whether the synthetization was successful and that you can share the synthetic data across your business and partnerships.

If you want to see the detailed privacy and accuracy charts, click on View full report and scroll down. We also created a guide to help you better understand all the privacy and accuracy metrics.

QA report

To download your synthetic data, click on View job summary and then on Download synthetic data. You can find this button in the Generating section of the job summary.

Download

On your computer, navigate to the Downloads folder and open the .ZIP file, you’ll find the same number of .CSV files as in your original dataset, and they have the same columns and number of rows.

What’s changed is that the content is entirely composed of fictional characters that do not reveal any information on your original subjects.

5. Further reading

Congratulations, you’ve created your first synthetic dataset in only a few easy steps!

If you want to learn more about all the configuration options available to you, please read our comprehensive guide to configuring a synthetization job.

We also created a guide to help you better understand all the details of the QA report. This report provides an in-depth analysis of how accurately your original dataset’s statistical features were reproduced in the synthetic version.