How much does bad data cost you every year? IBM has estimated that bad data costs over 3.1T per year. From the advent of perceptrons in 1943 to the popularization of neural networks in the 1980s to large language models in the late 2010s, machine learning models have evolved quickly. The original bounding factor for machine learning models was computational power.

While computational power was the main constraint for machine learning models, the main innovation was model architecture. As computational power has increased following Moore’s law, we saw a shift in the bounding factor in the late 2000s. The 2010s saw the rise of “big data” and data is now the most important factor in your machine learning strategy.

What’s one of the biggest limiters when it comes to machine learning data? The amount and quality of your labeled data. There is tons and tons of data out there, but there isn’t enough labeled data for machine learning. On top of that, not all labeled data is good data. If you want the best machine learning models out there, you’re going to need to ensure that you have data of the highest quality.

There are a few ways to gather machine learning data in large enough quantities to train quality models. We can obviously go down the path of gathering and labeling real world data, but as we talked about above, this can be quite a difficult and expensive task. Recently, there’s been a new development in machine learning data - synthetic data. Synthetic data is data that is statistically similar to an input dataset.

The beauty of using synthetic data for machine learning is that you can train models on high quality data without having to gather tons of data. We can use synthetic data to generate 5x, 10x, 100x, or more times of the data that we’ve gathered. This lowers our cost of data acquisition geometrically and can increase the accuracy of our machine learning models.

In this article, we’re going to discuss the importance of quality data and how to gather that data for machine learning. We will cover:

What is the impact of data for machine learning?

“Garbage in, garbage out” - George Fuechsel
Image from Wikimedia

Machine learning has been around since the 1940s. Neural networks started with the advent of the perceptron in 1943. The image above shows a simple perceptron. The example perceptron takes two inputs, does a calculation, and spits out a result. The calculation requires “weights” applied to each input and classically also includes a “bias” - a constant that is always added to the calculation. By putting many perceptrons together, we get a neural network.

A brief guide of neural networks

How do neural networks facilitate machine learning and what is the impact of data for machine learning? Perceptrons and neural networks “learn” by changing their weights and biases. When the output from a neural network is wrong, it performs backpropagation to update its weights and biases. This means that there are two main factors that affect the performance of a machine learning model. These factors are the architecture of the neural network and the data that you gather for machine learning.

Before we dive more into the impact of data for machine learning, let’s do a brief tour of neural network architectures. The main neural network architectures are Recurrent Neural Networks (RNNs) and Convolution Neural Networks (CNNs). Within these two architectures exist many variations based on the type of calculation done inside of each “node” or perceptron. For example, RNNs have two well known variations - Long Short Term Memory (LSTM) and Gated Recurrent Units (GRU). On the most basic level, RNNs are most commonly used for natural language processing and CNNs are most commonly used for image recognition.

How does a machine learn?

Other than architecture, the other piece of the puzzle for machine learning is data. “Big data” has been all the rage since the mid 2010s. Why? Because the main driver of machine learning model architecture change is in the applications of machine learning. We hit the inflection point of realizing that we didn’t need to change our models much anymore for our three main applications of machine learning - natural language processing, computer vision, and predictions on tabular data.

Language processing can include using data for machine learning on text or voice. Image processing can be related to reading, but is mostly applied within the realm of the use of sight. Machine learning for tabular data is mainly used to predict outcomes from existing data.

Once we realized that we had tackled most of the ways we could apply different neural networks to real-life problems we realized that we had another problem: data. Once we have a network architecture, the data we input into our machine learning system is the most important aspect. As we said above, garbage in, garbage out. We call it machine learning because the machine (neural network) “learns” from experience. The experience that the machine learns from is the data that we give it.

Neural networks are trained (“learn”) in epochs. When we begin to train a neural network, we first randomize our starting weights and biases. Within each training or learning epoch, our neural network makes a series of predictions based on the input data. The network does not update any weights or biases if it correctly predicts the output data based on the input data. However, if it incorrectly predicts an output, then it changes its weights and biases through backpropagation. That’s why, in today’s machine learning paradigm, the most important thing is the data we use.

What are your options for getting machine learning data?

There are four main sources for machine learning data. You can use your customer data, you can buy data from a third party, you can obtain data by scraping the web, and you can use synthetic data generated from real world seed data to protect customer privacy. This further breaks down into two more sections: real world data and synthetic data.

The two easiest ways to get data for machine learning are using your customer data and buying data from a third party. Unfortunately, these are not always the best ways. Customer data can be sensitive and third party data may be useless, defective, or expensive. Web scraping is a cheaper alternative to buying data, but the data will likely be extremely messy. Synthetic data requires some real world seed data, but may be the strong alternative solution you need for your advanced analytics, AI, and machine learning needs. Let’s take a closer look at all of our options.

Real world data for machine learning

Your closest source for real world data for machine learning is your customer data. Not only is customer data the most relevant data option for anything your company does, it’s already there. You probably have tons of customer data sitting in your data warehouse. However, it can be dangerous, break trust, and call your AI ethics into question.

If you operate in Europe, using customer data is already illegal due to the recent GDPR laws which require explicit customer consent. It’s important to protect the privacy of the individuals and companies using your product. Protecting customer privacy creates trust for you and your company. However, legacy data anonymization tools, like data masking, destroy customer data and makes it less suitable for machine learning use cases. The only way you can feasibly use customer data is after your data privacy team has already scrubbed all the useful information out of it.

What about buying data? You can find many places to buy data on the internet, if you look it up on Google, you can find 100s of companies selling datasets. The challenge is finding the right datasets for your needs. It’s like finding a needle in a haystack. Of course, you could hire companies that go out and find or gather the exact type of data you need. That endeavor will be both time consuming and expensive. In a world where data moves quickly, the data you get back may be out of date by the time you get it, and there may still be privacy risks and data cleaning tasks involved.

Aside from using customer data or buying data, we can scrape data from the web. Scraping data from the web gives us a lower cost alternative to gathering data for machine learning than buying it. We can also preserve customer privacy. However, web data is noisy and it is difficult to get the exact data we need for our machine learning tasks. In addition, there are rising concerns around the security and privacy of scraping data from the web.

Using synthetic data generation

The fourth option we can explore for getting machine learning data is using synthetically generated data. Synthetically generated data is data that is generated from real seed data. It is nearly statistically identical to the seed dataset. It provides two main advantages over using a real world dataset. First, it provides privacy without having to mask an overwhelming amount of useful data. Second, it provides a way to get more data without needing more customers, needing to buy the data, or needing to mine more data.

Getting customer data, buying data, and scraping data from the web all have clear, straightforward methods for obtaining the data involved. What about getting synthetic data? You can find ways to generate it yourself, use an open source tool, or use MOSTLY AI to generate synthetic data. Let’s take a look at the simplest way to get synthetic data - using MOSTLY AI.

Once you create an account and log in, the site walks you through the first steps to creating a synthetic dataset. They provide a sample dataset from the US Census that we can then upload and create a sample synthetic dataset from. The image below shows an example of the output from following the basic tutorial.

MOSTLY AI’s quality assurance report shows us four major blocks to consider. First, the accuracy of the data - how statistically similar it is to the seed data set. Second, the privacy tests, whether or not the data protects privacy well is dependent on the number of duplicates and two nearest neighbor tests. One nearest neighbor test, Distance to Closest Record (DCR), measures distance to the nearest neighbor and the other, Nearest Neighbor Distance Ratio (NNDR), measures the ratio of the 1st the 5th nearest neighbor.

Under the privacy tests block, there is one more important block to note - the identical match share block. This info tells us about the percentage of the synthetic and original data that contains duplicates. You want to be sure that the synthetic data match share is lower than the holdout data match share. The other two blocks to look at are the number of columns and the amount of synthetic data. The number of columns gives us a breakdown of the data types and the amount of synthetic data lets us know how much synthetic data was created.

Quality assurance report for synthetic data
MOSTLY AI's QA report for synthetic data

How do you get your data ready for machine learning use cases?

Most raw data is not ready for machine learning. There are many steps to take to get your data ready for machine learning. We always start by doing some exploratory data analysis (EDA). We’ll go over how to do EDA with Python in the next section. Following some initial data exploration, we do data cleaning. The point of data cleaning is to get rid of any possible snares in the data. Data cleaning involves steps like getting rid of nulls, detecting outliers, and finding redundant variables.

Sometimes your data has null values in it, and these values can throw an error in the training process of your machine learning model. If your data has outliers that don’t belong, it could cause your model to “learn” the wrong things. For example, if your machine learning model is meant to detect house cats but a bunch of your machine learning data contains lions labeled as cats, you’re going to get the wrong result.

How to synthetically generate data for machine learning

Synthetically generated data comes from a real world dataset. You can use any of your datasets as the seed data and simply upload it to MOSTLY AI’s synthetic data platform to generate synthetic data. Before you upload your data, make sure that it’s in the right format. You should have a subject table where each entry in the table describes one and only one real world entity or subject and each row can be treated independently of the others. Follow these simple steps to prepare your datasets for synthetic data generation.

Once we have our synthetically generated dataset, we can use it like any other dataset. Let’s take a look at how we can get started with synthetically generated data for machine learning from MOSTLY AI’s example dataset. There are three basic steps to this. Like any good data scientist, we start by exploring our original dataset. In the later sections we’ll also explore our synthetically generated data and understand the statistical similarities between them. 

Exploring our original dataset

The example dataset that MOSTLY AI provides, in case you don't have your own, is US census data. Lucky for us, this dataset is already cleaned and set up so that we don’t have to do much pre-processing work. There is one step that isn’t done for us yet though. Before we use the data, we should check how MOSTLY AI’s synthetic data generator works on a small portion of the data. Let’s take a look at our data and cut out a small portion to be uploaded as a test sample.

We use the pandas library to read in the CSV file that our data comes in. You’ll see in the code below that I have also set some options for how the data should be displayed. I set the `display.max_columns` option to `None` so that all 11 of our columns show on the same line. I’ve also set the `expand_frame_repr` option to `False` so that pandas does not create new pages to expand the frame.

We can see the first and last 5 entries from our `df.head` call. From this brief exploratory data analysis, we can see that there’s some nulls in the data. One entry is missing where the guy works and his occupation.

import pandas as pd
df = pd.read_csv("us-census-income.csv")
pd.set_option('display.max_columns', None)
pd.set_option('expand_frame_repr', False)
Synthetic data exploration

Now that we’ve taken a look at the basics of our dataset, let’s cut out a small portion of test data. For this example, we’ll take a sample of 10 data points from the original dataset. We can simply use pandas’ `.sample` function to pull out some rows. Then we can use pandas’ `to_csv` function to save our sample dataset as a CSV file that we can upload to MOSTLY AI’s synthetic data generator as we talked about above.

small_set = df.sample(n=10)

Check out the code on GitHub.

Exploring the synthetic dataset

Once we have uploaded our real world dataset and generated our synthetically generated data for machine learning, we should do some exploratory data analysis on it. MOSTLY AI provides a Quality Assurance report with each set of synthetically generated data. This QA report includes a set of graphs for each variable. For our example data, we get 11 of these graphs back. We also get 55 graphs back that compare how each variable correlates with each other variable.

Exploring the univariate distributions

Each graph contains two lines. The gray line represents the statistical distribution of the original dataset. The green line represents the statistical distribution of the synthetically generated dataset. We want the gray line and green lines to be close, but they do not  have to be exactly the same. If the lines are too far apart, then the data is not statistically similar enough to be used for the same machine learning tasks.

We can see from the chart below that our gray and green lines line up quite well for our example census data. The categorical data, such as sex and race, simply use the categories themselves as the “bins” on the x-axis. On the other hand, we split the numerical data into 10 bins based on percentile with each bin representing a decile.

Univariate distributions of the synthetic census data

Exploring the bivariate distributions

The second type of distribution that we have for our bivariate distributions. A bivariate distribution is a statistical model that describes the joint behavior of two random variables. This type of distribution allows us to analyze the relationship between the two variables and how they influence each other.

Understanding the bivariate distribution of a dataset can provide valuable insights into the relationship between the two variables. For example, a positive correlation between the variables would indicate that they tend to move in the same direction, while a negative correlation would indicate that they tend to move in opposite directions.

MOSTLY AI represents the bivariate distributions between the columns as matrices. For our example, we want to ensure that the gray and green data matrix look similar. By analyzing the joint behavior of two variables, data scientists can better understand the underlying patterns in the data and use this knowledge to make more accurate predictions. As with the univariate data, categorical data is measured against each other while numerical data is binned into deciles. The images below show some of these 55 matrices.

Bivariate distributions of the synthetic census data

Understanding statistical features of synthetic data via MOSTLY AI

There are three main factors that make synthetic data useful as training data for machine learning. First, it helps preserve customer privacy. Second, it helps you generate much more data than you have. Third, and perhaps most critical, it is statistically similar to real world data. When we create synthetic data on MOSTLY AI’s synthetic data platform, they give us a QA report which shows how statistically similar the synthetic data is to the real world seed data. Let’s take a look at how we can understand the statistical similarity graphs.

The first set of graphs here are the distance to closest record (DCR) graphs and the nearest neighbor distance ratio (NNDR) graphs. We briefly touched on these earlier. Now we will further explore them and why they are important.

The DCR metric shows us how far each synthetic data point is from the closest datapoint in the original dataset. A distance of 0 is a duplicate. This metric gives us confidence that our synthetic dataset is not just the original dataset with some added noise. That would be bad for data privacy concerns. The graphs below show the distribution and cumulative distribution of DCRs.

DCR privacy metric

Next let’s look at NNDRs. NNDRs show us the ratio between the DCR compared to the distance to other records. MOSTLY AI’s synthetic data platform gives us the NNDR ratio for the closest record compared to the fifth closest record. Just like the DCR graphs, the NNDR graphs show the distribution and the cumulative distribution for both the seed and synthetic data. The synthetic data masks privacy well enough if its line’s quantiles are not statistically below the target quantiles.

The next statistical similarity to consider is how the variables are correlated. MOSTLY AI provides three graphs for us to look at. The leftmost graph shows how the different variables in the seed dataset are correlated. The middle graph shows how the variables in the synthetically generated dataset are correlated. Ideally, you want these two graphs to look almost the same. The rightmost graph shows us the differences between the synthetic data variable correlations and the seed data variable correlations. You want this graph to be pretty light, like the one shown in the image below.

The last statistical piece we need to understand about how to evaluate synthetic data for machine learning is accuracy. Earlier, we saw one number that measured overall accuracy. The image below gives us a matrix visualization for accuracy. The darker blues are more accurate. But how is accuracy measured when comparing synthetic data, which needs to be sufficiently different to preserve privacy, to the seed data? Accuracy is measured as total variation distance subtracted from 100. TVD is calculated as the normalized L1 or Manhattan distance. It is the largest possible difference between the probabilities that the two probability distributions can assign to the same event.

Synthetic data accuracy

Take aways for gathering data for machine learning

Data and machine learning are tightly coupled. The results you get from your machine learning models are only as good as the quality of the data that you put into them. Neural networks learn from experience, and the only experience they have comes from the datasets you train them on. 

When it comes to gathering data for machine learning, you only have a few options. You can use your customer data - which needs to be scrubbed for privacy. Scrubbed data loses most of its value. You can buy your data. That could be exorbitantly expensive and take forever. You could scrape your data from the web. Most web data is noisy and messy, and we saw what happened with Microsoft’s Tay, which was trained on web data. The best option to preserve customer privacy and still have useful data is to use synthetic data.

Synthetic data is created from a real world seed dataset. It can be used for machine learning tasks because it is statistically similar to the original data. However, it is also different enough to be privacy preserving without having to scrub most of the useful information away. How can you get synthetic data?

MOSTLY AI provides a no-code platform to create synthetic data for machine learning. You simply upload your seed dataset and you’re ready to synthesize data. Once you have your synthesized dataset, it’s important to check out the statistics on it and make sure that it is similar to your real world dataset and preserves privacy. The important statistical features to check out for privacy are the distance to the closest record, the nearest neighbor distance ratio, and the correlations (both univariate and bivariate).