Synthetic data generation for software testing is here
Read more
Log in
Sign up

Synthetic training data for improving fraud and anomaly AI's performance

Download the case study
Synthetic CRM data for analytics with Telefónica


improvement in
machine learning


fewer cases
to investigate


saved by reducing
false positives


Fraud detection is a complex problem with many cutting-edge AI/ML solutions. However, these algorithms are only as good as the data used to train them. Traditionally used rule-based systems produce a high number of false positives and a labor-intensive follow-up process. Investigating a single customer for potential fraud can cost up to $24,000.1 AI/ML algorithms help reduce false positives and detect new frauds, but their performance is highly dependent on the quality of the training data. Rare, high-value frauds are often missed, and signals alerting to fraudulent activity can be misleading.


Upsample fraud patterns with synthetic training data to boost machine learning performance. Synthetic training sets are better than real data because by balancing the dataset, the model is able to detect fraud cases more efficiently, resulting in a consistently high AUC number. The “Area Under the Curve” (AUC) metric is used to evaluate how well a binary machine learning classifier (like a fraud detection algorithm) is performing. It is calculated by measuring the true positive rate against the false-positive rate. In other words, it explains how good the model is at separating fraud and non-fraud cases. The aim is to have a consistent and high AUC so that any threshold selected will have a high true-positive rate and low false-positive rate. Fresh, large batches of synthesized training data should be generated from raw transactional datasets periodically to recalibrate the algorithm and catch new patterns and signals.


Through various case studies, MOSTLY AI’s platform has shown to have a consistent improvement on the AUC curve from a relative 2–15% compared to using raw, imbalanced data.2 An improvement of 10% could yield a 10% decrease in false positives. Consider a model with a false positive rate of 1%. The model has identified 100 000 positive fraud cases, out of which ~1000 might not actually be fraud. If we lower the false positive rate to 0.9%, such that only ~900 are not correctly identified. Having ~100 fewer cases to investigate could lead to a savings of $2.4 Million.

1 How financial firms help catch crooks, The Economist
2 Boost your Machine Learning Accuracy with Synthetic Data, Michael Platzer