A Comparison of Synthetic Data Vault and MOSTLY AI, Part 3.2: Foreign Key Experiment, Revisited

Written by

Kenneth Hamilton

If you’d like to recreate this experiment yourself, follow along in the companion notebook.

Introduction

This blog is a revisiting of the third in a three-part series of comparisons assessing the synthetic data generation functionalities and features of two leading synthetic data libraries: Synthetic Data Vault (SDV) and the Synthetic Data SDK by MOSTLY AI.

This experiment expands on an earlier comparison of the two platforms we published comparing a smaller dataset, but the goal of this iteration is to highlight how these tools handle a two-table dataset containing millions of rows, as well as complex relationships in the underlying data model, with a special emphasis on the handling of multiple foreign keys in a subject data table.

When the original comparison was published in August, the engineers of Synthetic Data Vault provided feedback on the configuration that was used during the comparison and shared their own notebook that delivered the expected results. As our goal has always been to create an objective comparison between the two platforms, like we did in our first and second parts of this series, we happily removed our initial version of the third blog post, reviewed and executed the configuration provided by SDV and now provide this blog post as an exploration of the quality of the data generated by each platform.

This experiment was performed on a MacBook Pro with an Apple M4 processor and 32GB of memory; different equipment is likely to produce different results. Check out Part II of the series, focusing on a two-table scenario with sequential data, here.

Experiment Setup

For this experiment, we’ll use the Global Legal Entity Identifier Foundation (GLEIF) dataset to assess synthetic data generation performance in cases where referential integrity in generated data is critical. The GLEIF dataset contains two tables: organizations and relations.

Organizations are identified by a unique identifier as well as some basic metadata about the company’s industry and location. Relations are defined by the parent (START_ID) and child (END_ID) organization along with metadata about the nature of their relationship; whether the parent organization is a partner or investor organization with respect to the child organization as well as the status of that relationship.

Each of these parent and child organization IDs corresponds to an organization ID found in the organization table. This data model, using two foreign keys in a single table, presents an interesting challenge for many synthetic data generation engines. Specifically, when tools are limited to a single foreign key object per entity, the second ID column must be treated as a simple integer or string sequence and is therefore unlikely to reflect the trends present in the underlying data.

Training and Generation

We’ll compare two approaches to synthetic data generation for this dataset with SDV and the Synthetic Data SDK by MOSTLY AI. Both tools aim to produce realistic, privacy-safe synthetic data, but they differ significantly in architecture, capabilities, and limitations. Our focus is on how closely the generated synthetic data for the START_ID and END_ID fields mimics the structure and shape of the original subject dataset. After all, one of the primary reasons to use synthetic data over approaches like homomorphic encryption is that it preserves not only statistical fidelity but also human-readable structure, enabling direct use in analytics and downstream workflows. In this instance, we’d expect that generated ID values would look and feel like those in the subject dataset (having the same length, structure, and other attributes as those found in the original data).

SDV supports single-table and multi-table generation and offers models like HMASynthesizer to preserve relationships. However, while SDV can automatically detect metadata and infer table relationships, we’ll observe that SDV struggles to create realistic data that maintains the statistical properties of the underlying data.

MOSTLY AI maintains the open-source Synthetic Data SDK that uses deep learning and the autoregressive model TabularARGN to generate realistic synthetic data. The Synthetic Data SDK can model complex relationships all while generating synthetic data that maintains the statistical properties of the underlying subject data.

Quality Assessment and Comparison

After generating synthetic data with both SDV and MOSTLY AI, we can conduct a thorough evaluation to assess the quality, privacy, and structural integrity of each dataset using the Synthetic Data Quality Assurance library from MOSTLY AI.

Our assessment focuses specifically on the quality of the data in the two foreign key columns, START_ID and END_ID. The Synthetic Data Quality Assurance library by MOSTLY AI provides a suite of quantitative metrics and detailed reports that evaluate how closely the synthetic data matches the original data across multiple dimensions, and we can use the tgt_context_key parameter to focus on each of the respective foreign keys in our experiment.

The correlations and accuracy for the context and non-context foreign keys, respectively, for MOSTLY AI are shown below:

The correlations and accuracy for the context and non-context foreign keys, respectively, for SDV are shown below:

The main takeaway from the above is that the data generated by MOSTLY AI is far more statistically similar to the underlying data from the GLEIF dataset. We can dig further into the Model Report to understand exactly why that is.

Looking at the Univariate Correlations section of the Model Report shows the primary challenge for both models, the Sequence Length. Put simply, Sequence Length represents the model’s memory for a given entity. That means that if, in the underlying dataset, most organizations had 3-5 relationships, but for whatever reason, certain kinds of organizations had more or less relationships, the model would be able to successfully recognize those patterns and map them to any generated synthetic data.

MOSTLY AI’s Sequence Length performance for the START_ID and END_ID foreign keys, respectively, is shown here:

SDV’s Sequence Length performance for the START_ID and END_ID foreign keys, respectively, is shown here:

Perhaps a more readily understandable metric to observe is that of TYPE, which represents the nature of the relationship between the two organizations defined in the GLEIF dataset. For a full list of expected values, consider the companion notebook to this blog post.

Here are the results of MOSTLY AI for the TYPE field:

Here are the results of SDV for the TYPE field:

Conclusions

This experiment shows quite convincingly that, despite both frameworks being able to generate Synthetic Data that maintains referential integrity (that is, all values in the generated foreign key columns correspond to values generated in the parent key columns), the quality of the generated data differs significantly.

MOSTLY AI is able to not only generate Synthetic Data which passes referential integrity sense checks, but also which largely maintains the statistical patterns, relationships, and sequences found in the underlying dataset.

When you need to generate Synthetic Data that is indistinguishable from the real thing, for use cases like training ML models that can be used on your real-world data, the choice is clear. The choice is MOSTLY AI.