In this tutorial, you will learn how to build a machine-learning model that is trained to distinguish between synthetic (fake) and real data records. This can be a helpful tool when you are given a hybrid dataset containing both real and fake records and want to be able to distinguish between them. Moreover, this model can serve as a quality evaluation tool for any synthetic data you generate. The higher the quality of your synthetic data records, the harder it will be for your ML discriminator to tell these fake records apart from the real ones.

You will be working with the UCI Adult Income dataset. The first step will be to synthesize the original dataset. We will start by intentionally creating synthetic data of lower quality in order to make it easier for our “Fake vs. Real” ML classifier to detect a signal and tell the two apart. We will then compare this against a synthetic dataset generated using MOSTLY AI's default high-quality settings to see whether the ML model can tell the fake records apart from the real ones.

The Python code for this tutorial is publicly available and runnable in this Google Colab notebook.

ML Classifier for synthetic and real data

Fig 1 - Generate synthetic data and join this to the original dataset in order to train an ML classifier.

Create synthetic training data

Let’s start by creating our synthetic data:

  1. Download the original dataset here. Depending on your operating system, use either Ctrl+S or Cmd+S to save the file locally. 
  1. Go to your MOSTLY AI account and navigate to “Synthetic Datasets”. Upload the CSV file you downloaded in the previous step and click “Proceed”.
  1. Set the Training Size to 1000. This will intentionally lower the quality of the resulting synthetic data. Click “Create a synthetic dataset” to launch the job.
Synthetic data generation in MOSTLY AI

Fig 2 - Set the Training Size to 1000.

  1. Once the synthetic data is ready, download it to disk as CSV and use the following code to upload it if you’re running in Google Colab or to access it from disk if you are working locally:
# upload synthetic dataset
import pandas as pd

    # check whether we are in Google colab
    from google.colab import files

    print("running in COLAB mode")
    repo = ""
    import io

    uploaded = files.upload()
    syn = pd.read_csv(io.BytesIO(list(uploaded.values())[0]))
        f"uploaded synthetic data with {syn.shape[0]:,} records"
        " and {syn.shape[1]:,} attributes"
    print("running in LOCAL mode")
    repo = "."
    print("adapt `syn_file_path` to point to your generated synthetic data file")
    syn_file_path = "./census-synthetic-1k.csv"
    syn = pd.read_csv(syn_file_path)
        f"read synthetic data with {syn.shape[0]:,} records"
        " and {syn.shape[1]:,} attributes"

Train your “fake vs real” ML classifier

Now that we have our low-quality synthetic data, let’s use it together with the original dataset to train a LightGBM classifier. 

The first step will be to concatenate the original and synthetic datasets together into one large dataset. We will also create a split column to label the records: the original records will be labeled as REAL and the synthetic records as FAKE.

# concatenate FAKE and REAL data together
tgt = pd.read_csv(f"{repo}/census-49k.csv")
df = pd.concat(
df.insert(0, "split", df.pop("split"))

Sample some records to take a look at the complete dataset:


We see that the dataset contains both REAL and FAKE records.

By grouping by the split column and verifying the size, we can confirm that we have an even split of synthetic and original records:


FAKE 48842 
REAL 48842 
dtype: int64

The next step will be to train your LightGBM model on this complete dataset. The following code contains two helper scripts to preprocess the data and train your model:

import lightgbm as lgb
from lightgbm import early_stopping
from sklearn.model_selection import train_test_split

def prepare_xy(df, target_col, target_val):
    # split target variable `y`
    y = (df[target_col] == target_val).astype(int)
    # convert strings to categoricals, and all others to floats
    str_cols = [
        for col in df.select_dtypes(["object", "string"]).columns
        if col != target_col
    for col in str_cols:
        df[col] = pd.Categorical(df[col])
    cat_cols = [
        for col in df.select_dtypes("category").columns
        if col != target_col
    num_cols = [
        col for col in df.select_dtypes("number").columns if col != target_col
    for col in num_cols:
        df[col] = df[col].astype("float")
    X = df[cat_cols + num_cols]
    return X, y

def train_model(X, y):
    cat_cols = list(X.select_dtypes("category").columns)
    X_trn, X_val, y_trn, y_val = train_test_split(
        X, y, test_size=0.2, random_state=1
    ds_trn = lgb.Dataset(
        X_trn, label=y_trn, categorical_feature=cat_cols, free_raw_data=False
    ds_val = lgb.Dataset(
        X_val, label=y_val, categorical_feature=cat_cols, free_raw_data=False
    model = lgb.train(
        params={"verbose": -1, "metric": "auc", "objective": "binary"},
    return model

Before training, make sure to set aside a holdout dataset for evaluation. Let’s reserve 20% of the records for this:

trn, hol = train_test_split(df, test_size=0.2, random_state=1)

Now train your LightGBM classifier on the remaining 80% of the combined original and synthetic data:

X_trn, y_trn = prepare_xy(trn, 'split', 'FAKE')
model = train_model(X_trn, y_trn)

Training until validation scores don't improve for 5 rounds 
Early stopping, best iteration is: 
[30] valid_0's auc: 0.594648

Next, score the model’s performance on the holdout dataset. We will include the model’s predicted probability for each record. A score of 1.0 indicates that the model is fully certain that the record is FAKE. A score of 0.0 means the model is certain the record is REAL.

Let’s sample some random records to take a look:


We see that the model assigns varying levels of probability to the REAL and FAKE records. In some cases it is not able to predict with much confidence (scores around 0.5) and in others it is  quite confident and also correct: see the 0.0727 for a REAL record and 0.8006 for a FAKE record.

Let’s visualize the model’s overall performance by calculating the AUC and Accuracy scores and plotting the probability scores:

import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.metrics import roc_auc_score, accuracy_score

auc = roc_auc_score(y_hol, hol.is_fake)
acc = accuracy_score(y_hol, (hol.is_fake > 0.5).astype(int))
probs_df = pd.concat(
        pd.Series(hol.is_fake, name="probability").reset_index(drop=True),
        pd.Series(y_hol, name="target").reset_index(drop=True),
fig = sns.displot(
    data=probs_df, x="probability", hue="target", bins=20, multiple="stack"
fig = plt.title(f"Accuracy: {acc:.1%}, AUC: {auc:.1%}")
AUC probability curve - fake or real

As you can see from the chart above, the discriminator has learned to pick up some signals that allow it with a varying level of confidence to determine whether a record is FAKE or REAL. The AUC can be interpreted as the percentage of cases in which the discriminator is able to correctly spot the FAKE record, given a set of a FAKE and a REAL record.

Let’s dig a little deeper by looking specifically at records that seem very fake and records that seem very real. This will give us a better understanding of the type of signals the model is learning.

Go ahead and sample some random records which the model has assigned a particularly high probability of being FAKE:


In these cases, it seems to be the mismatch between the education and education_num columns that gives away the fact that these are synthetic records. In the original data, these two columns have a 1:1 mapping of numerical to textual values. For example, the education value Some-college is always mapped to the numerical education_num value 10.0. In this poor-quality synthetic data, we see that there are multiple numerical values for the Some-college value, thereby giving away the fact that these records are fake.

Now let’s take a closer look at records which the model is especially certain are REAL:


These “obviously real” records are types of records which the synthesizer has apparently failed to create. Thus, as they are then absent from the synthetic data, the discriminator recognizes these as REAL.

Generate high-quality synthetic data with MOSTLY AI

Now, let’s proceed to synthesize the original dataset again but this time using MOSTLY AI’s default settings for high-quality synthetic data. Run the same steps as before to synthesize the dataset except this time leave the Training Sample field blank. This will use all the records for the model training, ensuring the highest-quality synthetic data is generated. 

synthetic data generation with training on all available data records

Fig 3 - Leave the Training Size blank to train on all available records.

Once the job has completed, download the high-quality data as CSV and then upload it to wherever you are running your code. 

Make sure that the syn variable now contains the new, high-quality synthesized data. Then re-run the code you ran earlier to concatenate the synthetic and original data together, train a new LightGBM model on the complete dataset, and evaluate its ability to tell the REAL records from FAKE.

Again, let’s visualize the model’s performance by calculating the AUC and Accuracy scores and by plotting the probability scores:

probability score of ML classifier for combined synthetic and real data records

This time, we see that the model’s performance has dropped significantly. The model is not really able to pick up any meaningful signal from the combined data and assigns the largest share of records a probability around the 0.5 mark, which is essentially the equivalent of flipping a coin.

This means that the data you have generated using MOSTLY AI’s default high-quality settings is so similar to the original, real records that it is almost impossible for the model to tell them apart. 

Classifying “fake vs real” records with MOSTLY AI

In this tutorial, you have learned how to build a machine learning model that can distinguish between fake (i.e. synthetic) and real data records. You have synthesized the original data using MOSTLY AI and evaluated the resulting model by looking at multiple performance metrics. By comparing the model performance on both an intentionally low-quality synthetic dataset and MOSTLY AI’s default high-quality synthetic data, you have seen firsthand that the synthetic data MOSTLY AI delivers is so statistically representative of the original data that a top-notch LightGBM model was practically unable to tell these synthetic records apart from the real ones.

If you are interested in comparing performance across various data synthesizers, you may want to check out our benchmarking article which surveys 8 different synthetic data generators.

What’s next?

In addition to walking through the above instructions, we suggest:

  • measuring the Discriminator's AUC if more training samples are used,
  • using a different dataset,
  • using a different ML model for the discriminator, eg. a RandomForest model,
  • generate synthetic data with MOSTLY using your own dataset,
  • using a different synthesizer, eg. SynthCity, SDV, etc.

You can also head straight to the other synthetic data tutorials: