If you’d like to recreate this experiment yourself, follow along in the companion notebook.
Introduction
This blog is the second in a three-part series of comparisons assessing the synthetic data generation functionalities and features of two leading synthetic data libraries: Synthetic Data Vault (SDV) and the Synthetic Data SDK by MOSTLY AI.
This experiment expands on an earlier comparison of the two platforms we published comparing a smaller dataset, but the goal of this iteration is to highlight how these tools handle a two-table dataset containing hundreds of thousands of rows, as well as complex relationships in the underlying data model, with a special emphasis on maintaining coherence in sequential data.
This experiment was performed on a MacBook Pro with an Apple M4 processor and 32GB of memory; different equipment is likely to produce different results. Check out Part I of the series, focusing on a single-table scenario, here.
Experiment Setup
For this experiment, we’ll use two tables from the Berka dataset to assess synthetic data generation performance in cases where sequential coherence in generated data is critical. The two tables we’ll focus on for this comparison are the account and transaction tables.
In real-world financial datasets like Berka, values such as account balances must evolve in a logical sequence based on associated transactions. Randomly generating balance values, even within valid ranges, fails to produce useful data if sequential logic is not preserved. The utility of synthetic data depends on capturing these dynamic, evolving patterns.
Before training, we’ll create an 80/20 training and testing split based on the accounts table to ensure that the holdout data can be used to validate the generative performance of each framework on unseen data. We’ve chosen the accounts table rather than the transactions table because for former is the context for the latter. This approach allows us to measure how well the synthetic data replicates patterns not just from the training data, but from the overall distribution.
Training and Generation
Now we’ll compare two approaches of generating synthetic data for this two-table dataset with SDV and the Synthetic Data SDK by MOSTLY AI. Both tools aim to produce realistic, privacy-safe synthetic data, but they differ significantly in architecture, capabilities, and limitations. Our focus is on how each handles sequential coherence in the generated synthetic data.
SDV supports single-table and multi-table generation and offers models like HMASynthesizer to preserve relationships. However, while SDV can automatically detect metadata and infer table relationships, we’ll observe that SDV struggles to generate sequentially coherent data, which is critical for our use case. This presents a major limitation in our case because the entire goal of this synthetic data generation task is to create highly usable and realistic data.MOSTLY AI maintains the open-source Synthetic Data SDK that uses deep learning and the autoregressive model TabularARGN to generate realistic synthetic data. The Synthetic Data SDK supports two-table scenarios and can model complex relationships. For our use case, MOSTLY AI correctly maintains the sequential coherence that we need in order to create truly usable synthetic data.
# Configure the generator for the sequential scenario
config = {
'name': 'Berka Generator',
'tables': [
{
'name': 'account',
'data': accounts_train,
'primary_key': 'account_id',
'tabular_model_configuration': {
'enable_model_report': False
}
},
{
'name': 'transaction',
'data': transactions_train,
'primary_key': 'trans_id',
'foreign_keys': [
{'column': 'account_id', 'referenced_table': 'account', 'is_context': True}
],
'tabular_model_configuration': {
'enable_model_report': False
}
}
]
}
Quality Assessment and Comparison
After generating synthetic data with both SDV and MOSTLY AI, we can conduct a thorough evaluation to assess the quality, privacy, and structural integrity of each dataset using the Synthetic Data Quality Assurance library from MOSTLY AI.
Our assessment focuses on statistical fidelity and sequential coherence. The Synthetic Data Quality Assurance library by MOSTLY AI provides a suite of quantitative metrics and detailed reports that evaluate how closely the synthetic data matches the original data across multiple dimensions, including sequential coherence.
We’ll examine univariate (single variable), bivariate (two variable), and trivariate (three variable) distributions, correlation structures, and overall similarity metrics. In addition, we’ll evaluate privacy using the Distance to Closest Record (DCR) score, which helps identify potential overfitting or re-identification risks. These evaluations are performed on both the account and transaction tables, comparing the synthetic data against training and holdout sets to measure generalization.
Accuracy
MOSTLY AI consistently outperforms SDV in terms of statistical fidelity, as indicated by a significantly higher accuracy score. For the Transactions table, MOSTLY AI achieves an accuracy of 90.3%, compared to SDV’s 40.5%, representing a difference of 49.8 percentage points, an improvement of over 120% in relative terms. Both models perform well when considering DCR Share (which is a measure of the balance between privacy preservation and data utility).
Auto-correlations
The second part of our assessment addresses sequential coherence. We’ll check whether the distribution of relationships, such as the number of transactions per customer and the value of the account balance before and after each transaction, reflects realistic usage patterns and, more importantly, that the sequential integrity of all generated data is respected. This step is essential for validating that the synthetic data preserves not just isolated values but also the structural logic of the original dataset.
Each analysis performed against the generated transactions table generated by MOSTLY AI showed strong coherence with the real trends observed in the Berka dataset.
While SDV was able to maintain coherent correlation between certain observed variables, with respect to sequential data like date, balance, and amount, the generated data completely broke from trends observed in the Berka dataset.
In the context of synthetic data evaluation, coherence refers to how well a model preserves the internal structure of individual features over time or sequence, also known as auto-correlations.
When comparing the results from the Synthetic Data SDK and SDV, we observe a significant difference in performance across all variables. SDV shows weak preservation of auto-correlations, with notably low scores for amount ~ amount (5.6%), balance ~ balance (10.8%), and date ~ date (18.3%), indicating that the synthetic data generated fails to replicate the sequential or repetitive patterns seen in the original dataset. By contrast, the Synthetic Data SDK demonstrates substantially stronger coherence across all variables, with scores consistently above 88%, including amount ~ amount (88.3%), balance ~ balance (89.6%), and date ~ date (94.5%).
This suggests that the Synthetic Data SDK is far more effective at capturing the intrinsic temporal and categorical structure of the original data. The visual heatmaps generated by the MOSTLY AI Synthetic Data Quality Assurance library confirm this difference: the synthetic panels closely mirror the original distributions, whereas SDV displays weaker and less structured patterns, indicating a degradation in feature-level consistency.
Sequences
Sequences per distinct category comparisons between the Synthetic Data SDK and SDV further illustrate the performance gap in how well the two models capture sequential fidelity within categorical and numeric variables. With SDV, coherence scores are generally low, with amount at 26.0%, balance at 16.9%, and k_symbol at 26.7%. The line plots shown below visualize the significant divergence between the synthetic and original distributions, particularly in numeric variables, where the synthetic data fails to match real-world value progression. This suggests weak modeling of temporal or logical progressions within these features.
By contrast, the Synthetic Data SDK shows near-perfect alignment between synthetic and real data, with coherence scores like 97.3% for amount, 98.5% for balance, and 93.9% for k_symbol. In these plots, the synthetic and original lines closely track each other across both distribution and binned views, indicating that the Synthetic Data SDK captures category-specific sequences and transitions with high precision.
The high fidelity across all features in data generated by the Synthetic Data SDK reflects a more effective learning of column dependencies, especially with respect to the modeling of temporal and financial variables that are critical for dataset utility and value.
Conclusions
This experiment demonstrates the critical role of sequential coherence in generating high-quality synthetic data for multi-table datasets. While SDV offers flexibility and is widely used in research contexts, it struggled to model sequential and relational patterns in this setup. In contrast, MOSTLY AI’s Synthetic Data SDK consistently delivered high utility and privacy-preserving synthetic data with strong fidelity across all dimensions, making it a reliable solution for production use cases involving temporal and relational complexity.
If your use case involves analytics, compliance testing, or anything that relies on downstream joins, synthetic data generation by MOSTLY AI offers the kind of structural fidelity data professionals need to be successful. It’s a production-ready option for organizations that need synthetic data that is actually better than the real thing.
Try the Synthetic Data SDK by MOSTLY AI and experience state-of-the-art synthetic data generation that’s as realistic as it is secure.