TL;DR

  • Synthetic data generation methods changed significantly with the advance of AI
  • AI-generated, sample-based synthetic data is an entirely different beast than random or mock data
  • Types of AI-generated synthetic data include synthetic images, synthetic text, synthetic geolocation data, categorical, numerical and time-series data.
  • Stochastic processes are still useful if you care about data structure but not content
  • Rule-based systems can be used for simple use cases with low, fixed requirements toward complexity
  • Use deep generative models to automatically retain structure as well as information of data at scale to unlock private data and reduce model-to-market time
  • What the most common synthetic data types and their most common use cases are.

An overview of data generation methods

Not all synthetic dataset is created equal and in particular, synthetic data generation today is very different from what it was 5 years ago. Let’s take a look at different methods of synthetic data generation from the most rudimental forms to the state-of-the-art methods to see how far the technology has advanced! In this post we will distinguish between three major methods:

  • The stochastic process: random data is generated, only mimicking the structure of real data.
  • Rule-based data generation: mock data is generated following specific rules defined by humans.
  • Deep generative models: rich and realistic synthetic data is generated by a machine learning model trained on real data, replicating its structure and the information it contains.

Generate synthetic data for software testing

MOSTLY AI's synthetic data platform uses AI to do all the heavy lifting for test engineers by learning business rules and synthesizing entire databases. Register your free forever account and synthesize up to 100K rows of production-like data daily!

Which data generation method should you choose? Evaluation metrics 101

The choice of method depends on the use case and should be evaluated - if possible - both by an expert on the data synthesis and by a domain expert, who is familiar with the data and its downstream usage.
In addition to use-case-specific criteria, several general aspects can be used to evaluate and compare the different methods available.

  • Computation: how much compute is needed to generate data or to build a model.
  • Human labor: how much human expertise and labor goes into the generation process.
  • System complexity: how difficult it is to build such a data generation system.
  • Information content: how much information is present in the synthetic data.

The stochastic process: when form matters more than content

If the structure of the desired synthetic data is known, and the data distribution is irrelevant -when random noise is all you need - the stochastic process is a perfect method.

An example would be where the synthetic dataset should take the form of a CSV file with a specific number of columns and rows. A random number generator can be used to fill the fields following a defined distribution.

The applicability of such a process is limited to cases where the content of the synthetic data is irrelevant and random noise is good enough in place of real data. Examples of such applications would be stress testing systems, where a huge amount of random data is generated on the fly to evaluate how systems behave under heavy use.

  • Computation: generating random data has very small computational needs and can be performed on the fly whenever random data is needed.
  • Human labor: the data structure of the synthetic data can be defined easily, or inferred from an existing dataset, reducing the human expertise and labor to a minimum.
  • System complexity: these systems are the easiest to build and challenges during implementation often concern computational efficiency to maximize the amount of data that can be generated with given resources.
  • Information content: the generated synthetic data contains no relevant information.

Rule-based data generation: the human-powered machine

The obvious downside to data generated with stochastic processes is their limited use-cases since the data is random and contains no real information.
Rule-based data generation systems improve on that by using hand-generated data following specific rules defined by humans. The complexity of those rules can vary from very simple, taking only the desired data type of a column into account (i.e. if a column contains numeric, categorical, or text data), to more sophisticated rules, that define relationships between various columns and events. The amount of human labor and expertise needed, as well as the information contained in the generated data, are therefore completely dependent on the defined rules.

Thus, rule-based methods come with three additional challenges:

  • Scalability: datasets containing many different interdependent columns in a multi-table configuration easily need hundreds of complex and intertwined rules. Adding additional rules gets more and more difficult, practically limiting the maximum complexity of the data that can be modelled.
  • Bias: since the rules are defined by human experts, the bias of those experts is reflected in the rules and is therefore present in the generated data. Some table columns might reflect a clearly defined business logic, where bias is part of the agreed policy, but others (e.g. customer behavior, or patient history) might be more susceptible to unconscious human bias.
  • Drift: real world data is continuously changing, so rules need to be altered to reflect that change. Complex rule-based systems need effective change management governing the conditions and deciding which rules have to be adapted to reflect changes in the context of their application.

Coping with these challenges can be very difficult, and in many cases, they prove to be deal-breakers. Specifically, Scalability and Drift prevent rule-based systems to be used in applications that require flexibility and support for changing data requirements, effectively limiting its applicability to use cases where the scope and the data requirements are exactly known and will not change. But if these challenges are successfully met, a rule-based system can be a good enough choice for testing in applications, ranging from the generation of tabular data to multimedia content.

However, in any case, no additional information can be extracted from the rule-based synthetic data, then what was already known beforehand and manually encoded into the rules. Thus, these datasets offer no value for analytics, nor for decision support, nor for training machine learning models.

Several web-based tools exist where one can manually define the structure and simple rules to generate tabular data. Data generated with such systems can then be used for testing purposes in software development, or integration tests, ranging from the most typical to the testing of specific edge cases.

  • Computation: the computational resources needed to run such a system are completely governed by the number and the complexity of the rules, but in general, the method’s computation needs can be classified as minor to moderate.
  • Human labor: the amount of human labor and expertise needed to use such a system is extensive and much higher than any of the other methods described.
  • System complexity: the system grows with the number and complexity of the rules supported. In addition, the format/language and interface used to describe rules can be a major contributor to system complexity. In general, the method’s system complexity is high.
  • Information content: the information contained in the generated data is limited only by the applied rules.

AI-generated synthetic data: learning by example

Generative AI models are a class of statistical models that learn the distribution of training
data and can be used to generate new data following that distribution.

Applying generative models from machine learning, it is possible to train a machine learning model (e.g. an artificial neural network) with real data so that it learns the structure and the information contained and is able to generate new synthetic data. Synthetic data generators can be open source, like MIT's synthetic data vault or proprietary, like MOSTLY AI's synthetic data platform. When comparing synthetic data quality, MOSTLY AI's robust commercial solution outperformed SDV in a research conducted by the Joint Reserach Centre of the European Commission.
The human guidance needed by such a system can be minimal. In the best case, no human interaction is needed and the machine learning model is trained automatically.

The complexity of the data that can be learned by such a model is, primarily, limited by the data available and the model capacity (i.e. model architecture and hyperparameters). If the data requirements change, no significant adjustments are needed, simply a new model needs to be trained on the actual data.

Due to the power of machine learning models mimicking the training data, three new challenges unique to this approach have to be addressed:

  • Data similarity: The success of replicating the information contained within the original data depends on data complexity and the capacity of the model one chooses to use. For best results, special attention must be paid to testing and documenting the similarity of the synthetic data compared to the original data used to train the model. The metric of similarity is application-specific, but methods from descriptive statistics can be used to analyze the univariate and multivariate distribution, as well as the correlation between features for the synthetic and training data.
  • Privacy: with great power comes great responsibility. With many machine learning models being prone to overfitting, one needs to take particular caution to prevent the memorization of training examples. This is particularly important, if the training data is privacy-sensitive, and must be protected.
  • Business rules: generative models learn the distribution of features in the training data, and don't yet have a human-like understanding of what a feature means, or what relational qualities exist between features. However, a well-trained generative model will be able to learn most of the rules and retains relationships contained within the training data. Those rules, which are not yet adhered to, can be enforced by simply filtering out the few invalid records as part of the post-processing.

Once these challenges have been met, the applications of AI-generated synthetic data are almost limitless, and even go beyond what is possible with real data.
Two unique opportunities arise with the use of generative models. One is the use of synthetic data in place of the original data that cannot be accessed because of legal and privacy reasons, and the second is the use of synthetic data within a company to reduce the development time of machine learning models.

One such example, where synthetic data is playing a key role in unlocking original data protected for privacy reasons, is in finance. Here, synthetic data is used for example to improve fraud detection, since it contains the statistical properties necessary for improving fraud detection systems, without exposing the privacy of individuals. Sharing data across departments and even country borders becomes a seamless process when using a high quality, accurate and privacy compliant deep generative tool, like our very own MOSTLY AI synthetic data platform. As shown in our benchmarking study, MOSTLY AI is the world’s most accurate deep generative tool, which makes the most of all deep generative model advantages, such as the highest levels of statistical utility.

Another example is the use of synthetic data by data science, machine learning and business intelligence units. In most working environments, data access is strictly regulated resulting in time-consuming processes. Working with synthetic data instead of the original makes it possible to build models much faster and to reduce model-to-market time. We have shown in a recent study that models trained on synthetic data achieve comparable results, and in some cases even outperform models trained on original data. AI model development is increasingly relying on synthetic training data due to the possibilities of data augmentation during the synthesization process, turning data into modelling clay for data scientists and machine learning engineers.

Test data generation can also massively benefit from the power of AI. Synthetic test data generators can pick up on business rules encoded in the production data and automatically recreate them. The resulting synthetic test data is highly realistic and covers many more business cases than manual data generation ever could.

  • Computation: the training of machine learning models is in general very compute-intensive, but the problem can be mitigated by the use of special hardware (e.g. graphics cards, TPU, ASIC as well as others) and by the use of cloud computing.
  • Human labor: an optimal machine learning system can be application-agnostic, limiting user interaction to the selection of the training data used to create the models and to some post-processing - depending on the qualities of the dataset and its application.
  • System complexity: Machine learning-enabled data synthesis is by far the most complex of the described systems, as it contains multiple complex sub-components (e.g. machine learning model, training infrastructure, and user applications to interact with the system). The architecture of generative models for tabular data and its power to solve previously untouched problems, such as human bias and imbalances is an open research topic full of future potentials.
  • Information content: a model capable of generating synthetic data with high similarity to the training data maximizes information content, surpassing rule-based systems by a huge margin and in some cases is even better than raw data.

For a more in-depth introduction to generative models, take a look at this blog post by OpenAI or Stanford’s free course on the subject.

Rundown

Here is a summary of the methods compared and their performance on the metrics used to evaluate them.

How to choose the right method?

After having discussed the capabilities and challenges of the various generation methods, how do you decide which one matches best the requirements and use cases at hand? Two simple questions shall guide your decision:

  • Do you have resources/expertise available to build your own system, and
  • Does the synthetic data need to be realistic as well as representative?

Solutions making use of stochastic processes and rule-based systems are highly dependent on their use case, and almost always require the development of new property software. Libraries like cayenne or PyRATA can support such efforts, but require expertise, resources, and will to be maintained.

If the synthesized data has to be realistic, stochastic processes are out of the question and rule-based systems only make sense if it is clear what the data should look like and that description can be written in code.

If in-house development is not an option and the synthetic data has to be as realistic and representative as possible, the use of ML-enabled systems as a service is the best course of action.

How is MOSTLY AI making the most out of generative models?

With MOSTLY AI's synthetic data platform we are addressing the unique challenges coming with generative models. Our synthetic data generation platform comes with built-in privacy guarantees, preventing overfitting, and eliminating the risk of re-identification. Synthetic data is exempt from data privacy regulations, freeing up your previously untouched data assets for sharing and utilization. What’s more, generated synthetic datasets come with automated quality assurance reports featuring privacy and accuracy metrics, which make the assessment of the quality quick and painless.

We at MOSTLY AI are proud to serve customers worldwide. Our clients use synthetic data for finance, insurance and telecommunications use cases. Curious? Head over to the browser version of our product and generate synthetic data yourself for free forever, up to 100K rows a day! Hopefully, this post has been useful to you and provided you with a better understanding of how synthetic data generation has evolved from simple stochastic processes to sophisticated deep generative models. Feel free to reach out and contact us, either with feedback, questions, or any other concerns.

Types of synthetic data and the best synthetic data use cases

Deep generative models can tackle many different types of data, learn what they look like and generate as much or as little statistically and visually indistinguishable synthetic data as needed. It's a fascinating field of study where new use cases are born almost every day. Let's go through the most common AI-generated synthetic data types and their use cases.

Synthetic images for training computer vision models

Synthetic images were the first kinds of synthetic data to see the light of day. Computer vision models are increasingly relying on synthetic images to learn scenarios for which no real image exists. Synthetic image data is unstructured data by type. Unstructured data sets are unstructured because they do not come as pre-defined data models. As a result, it is full of ambiguities and irregularities, which are challenging for traditional, rule-based systems. However, this unstructured quality makes it very suitable for AI and machine learning models, which are exceptionally good at picking up even the smallest of patterns invisible to the human eye. Training computer vision models with real pictures of humans also comes with a host of privacy and bias issues. Some of the most exciting use cases in synthetic image data include:

  • one of the most popular 3D game development engine, Unity, is now used for generating synthetic images,
  • open source synthetic image banks are published to accelerate AI and machine learning development in healthcare,
  • synthetic shopping carts are next in line for computer vision: a retail start-up is using synthetic images to train their computer vision software for self-checkouts,
  • synthetic images are used to train computer vision models for logistics and supply chain applications.

Synthetic text data

Another common type of unstructured synthetic data is text. Although synthetic text isn't always meaningful for humans, AI and machine learning algorithms can read and understand synthetic text well. Synthetic text is perfect for machine learning tasks like sentiment analysis. It's especially useful for analyzing privacy sensitive pieces of text, like annotations for bank transactions or voice commands given to ioT devices. MOSTLY AI's synthetic data platform can generate synthetic text up to 1000 characters long. Some of the most common use cases for synthetic text include:

Synthetic geolocation data or geodata

Geolocation data - essentially coordinates - are especially sensitive pieces of information. However, when there is a need to share data sets containing geolocation data, the need for accuracy is usually very high. Old school data anonymization techniques, like data masking, destroy the utility of geolocation data and carry a high risk of deanonymization or membership inference attacks. Synthetic data generators can provide both the privacy and accuracy necessary for synthesizing geodata assets. Using synthetic geodata, banks can figure out where to instal new ATMs or detect fraudulent patterns. Some synthetic geolocation use cases include:

Time-series or behavioral data

Data related to experience, such as customer behavior records, like transaction data, are notoriously hard to anonymize. Similarly to geodata, due to the chronological order of datapoints in behavioral data sets, individuals are easy to identify. This so-called curse of dimensionality is why a taxi ride or your credit card history are pretty much like fingerprints. That is also one of the reasons why behavioral data assets are one of the most valuable synthetic data types. They contain the insights and patterns related to customer behavior, the most valuable for any business to uncover. Since AI is particularly good at pattern recognition, time-series data is well suited for synthesization. Synthetic behavioral or time series data is one of the most common and most valuable datatypes, mission-critical for developing well-performing AI and machine learning algorithms. Some of the best use cases are:

Categorical and numerical synthetic data

The most common synthetic data type in structured data assets are categorical and numeric values. Categorical data include, for example, marital status, gender, segmenting information and other pre-defined classes. Numerical data types include income, transaction amounts, insurance premiums, frequency of claims and so on. The quality and realism of this data is especially critical in software testing and QA. AI-generated test data provides a high level of realism and automatically picks up business rules contained in the data. The best synthetic data use cases in software development include:

Generate synthetic data for your software testing needs

MOSTLY AI's synthetic data platform uses AI to do all the heavy lifting for test engineers by learning business rules and synthesizing entire databases. Download our ebook to learn how AI-generated test data helps!