A Comparison of Synthetic Data Vault and MOSTLY AI, Part 3: Referential Integrity Scenario

Written by

Kenneth Hamilton

If you’d like to recreate this experiment yourself, follow along in the companion notebook.

Introduction

This blog is the third in a three-part series of comparisons assessing the synthetic data generation functionalities and features of two leading synthetic data libraries: Synthetic Data Vault (SDV) and the Synthetic Data SDK by MOSTLY AI.

This experiment expands on an earlier comparison of the two platforms we published comparing a smaller dataset, but the goal of this series is to highlight how these tools handle a two-table dataset containing millions of rows, as well as complex relationships in the underlying data model, with a special emphasis on referential integrity.

This experiment was performed on a MacBook Pro with an Apple M4 processor and 32GB of memory; different equipment is likely to produce different results. Check out Part II of the series, focusing on a two-table scenario with sequential data, here.

Experiment Setup

For this experiment, we’ll use the Global Legal Entity Identifier Foundation (GLEIF) dataset to assess synthetic data generation performance in cases where referential integrity in the generated dataset is critical. The GLEIF dataset contains two tables: organizations and relations.

Organizations are identified by a unique ID as well as some basic metadata about the company’s industry and location. Relations are defined by the parent (START_ID) and child (END_ID) organization along with information about the nature of their relationship; whether the parent organization is a partner or investor organization, for example, with respect to the child organization as well as the status of that relationship.

Each of these parent and child organization IDs in the relations table correspond to an organization ID found in the organization table. This data model, using two foreign keys in a single table, presents an interesting challenge for many synthetic data generation engines. Specifically, when tools are limited to a single foreign key object per entity, the second ID column must be treated as a simple integer or string sequence and is therefore unlikely to reflect the trends present in the underlying data.

As we’ve done before, we’ll create an 80/20 training and testing split based on the accounts table to ensure that the holdout data can be used to validate the generative performance of each framework on unseen data. For the split, we’ve chosen the organizations table rather than the relations table, since relations depend on organizations and can only be interpreted within that context. Splitting the data like this allows us to measure how well the synthetic data replicates patterns not just from the training data, but from the overall distribution.

Training and Generation

We’ll compare two approaches to synthetic data generation for this dataset with SDV and the Synthetic Data SDK by MOSTLY AI. Both tools aim to produce realistic, privacy-safe synthetic data, but they differ significantly in architecture, capabilities, and limitations. Our focus is on how each handles the referential integrity of a dataset which includes multiple foreign keys in a single table.

SDV supports single-table and multi-table generation and offers models like HMASynthesizer to preserve relationships. However, while SDV can automatically detect metadata and infer table relationships, we’ll observe that SDV struggles to map trends present in the underlying subject data when one of the source tables contains multiple foreign keys, which is critical for our use case. This presents a major limitation because the entire goal of this synthetic data generation task is to create highly usable and realistic data.

MOSTLY AI maintains the open-source Synthetic Data SDK that uses deep learning and the autoregressive model TabularARGN to generate realistic synthetic data. The Synthetic Data SDK can model complex relationships including those entities with multiple foreign keys. For our use case, MOSTLY AI correctly maintains the referential integrity that we need in order to create truly usable synthetic data.

Quality Assessment and Comparison

After generating synthetic data with both SDV and MOSTLY AI, we can conduct a thorough evaluation to assess the quality, privacy, and structural integrity of each dataset using the Synthetic Data Quality Assurance library from MOSTLY AI.

Our assessment focuses on referential integrity and the overall accuracy of the synthetic data. The Synthetic Data Quality Assurance library by MOSTLY AI provides a suite of quantitative metrics and detailed reports that evaluate how closely the synthetic data matches the original data across multiple dimensions, including referential integrity.

Overall Accuracy

Like any assessment of synthetic data, the most universally relatable metric is accuracy. We can see that the data generated by the MOSTLY AI was assessed at 94% overall accuracy, with each of the multivariate assessments also scoring greater than 90%.

By contrast, the synthetic data from SDV scored just 37.6% overall accuracy with the trivariate analysis returning just 19.1% accuracy, which means that the correlations between any three arbitrary columns had about as much of a chance at resembling the subject dataset as the roll of a die.

Referential Integrity

As mentioned above, in datasets with entities that contain multiple foreign keys, we are presented with a challenge that many synthetic datasets are not equipped to handle gracefully.

For this part of the assessment, we had to write a special script to assess the validity of the foreign key relationships in our synthetic datasets. Remember, the START_ID and END_ID columns in the relations table both reference the ID column in the organizations table.

In an ideal scenario, all of the IDs from START_ID and END_ID would reference synthetic IDs in the generated organizations table.

def verify_fk_integrity_orgs_relations(
    org_df: pd.DataFrame,
    rel_df: pd.DataFrame,
    provider_name: str,
    pk_col: str = "ID",
    fk_cols: tuple = ("START_ID", "END_ID"),
):
    """
    Comprehensive foreign key integrity verification for GLEIF-style graph (two FKs).

    Args:
        org_df: Synthetic organizations dataframe (parent/nodes)
        rel_df: Synthetic relations dataframe (child/edges)
        provider_name: Name of the synthetic data provider (e.g., 'SDV', 'MOSTLY AI')
        pk_col: Primary key column in org_df (default 'ID')
        fk_cols: Tuple of foreign key columns in rel_df (default ('START_ID','END_ID'))

    Returns:
        Dictionary with integrity and coverage metrics.
    """
    print(f"\n🏷️  {provider_name} Foreign Key Verification (relations → organizations on {fk_cols})")
    print("-" * 60)

    missing_org = [c for c in [pk_col] if c not in org_df.columns]
    missing_rel = [c for c in fk_cols if c not in rel_df.columns]
    if missing_org or missing_rel:
        raise KeyError(
            f"Missing required columns. organizations missing={missing_org}; relations missing={missing_rel}"
        )

    org = org_df.copy()
    rel = rel_df.copy()

    # Align FK dtypes to PK dtype (fallback to string if needed)
    pk_dtype = org[pk_col].dtype
    for fk in fk_cols:
        try:
            rel[fk] = rel[fk].astype(pk_dtype)
        except Exception:
            # fallback: cast all to string
            org[pk_col] = org[pk_col].astype(str)
            rel[fk] = rel[fk].astype(str)
            pk_dtype = org[pk_col].dtype  # update dtype

    # Build sets for checks
    org_ids = set(org[pk_col].dropna().unique())
    fk_sets = {fk: set(rel[fk].dropna().unique()) for fk in fk_cols}

    # Invalid FKs per column
    invalid_per_fk = {fk: fk_sets[fk] - org_ids for fk in fk_cols}
    total_invalid = sum(len(invalid_per_fk[fk]) for fk in fk_cols)

    # Edge-level validity masks
    valid_masks = {fk: rel[fk].isin(org_ids) for fk in fk_cols}
    both_valid_mask = valid_masks[fk_cols[0]] & valid_masks[fk_cols[1]]
    any_invalid_mask = ~(both_valid_mask)

    both_valid_edges = int(both_valid_mask.sum())
    total_edges = int(len(rel))
    any_invalid_edges = int(any_invalid_mask.sum())
    pct_both_valid = (both_valid_edges / total_edges * 100) if total_edges else 0.0
    pct_any_invalid = (any_invalid_edges / total_edges * 100) if total_edges else 0.0

    # Coverage: how many organizations are referenced by at least one endpoint
    referenced_orgs = set(rel.loc[valid_masks[fk_cols[0]], fk_cols[0]].dropna()) | set(
        rel.loc[valid_masks[fk_cols[1]], fk_cols[1]].dropna()
    )
    coverage_count = len(referenced_orgs)
    total_orgs = len(org_ids)
    coverage_pct = (coverage_count / total_orgs * 100) if total_orgs else 0.0

    # Degree distribution (count edges touching each org across BOTH endpoints)
    # Build a single Series of all endpoints (valid or not), then count
    endpoints = pd.concat([rel[fk_cols[0]], rel[fk_cols[1]]], ignore_index=True)
    degree_counts = endpoints.value_counts(dropna=True)
    avg_degree = float(degree_counts.mean()) if not degree_counts.empty else 0.0
    median_degree = float(degree_counts.median()) if not degree_counts.empty else 0.0
    max_degree = int(degree_counts.max()) if not degree_counts.empty else 0
    min_degree = int(degree_counts.min()) if not degree_counts.empty else 0

    referential_integrity_all = (total_invalid == 0) and (any_invalid_edges == 0)

Upon inspection though, we see that while MOSTLY AI was able to generate coherent data in both columns (with 100% of the generated values for START_ID and END_ID in the generated relations table corresponding to a generated value in the ID column of the synthetic organizations table), we saw the exact opposite with SDV. None of the generated values in END_ID corresponded to a synthetic ID value in the generated organizations table.

In a real-world scenario, this would prevent us from modeling relational trends present in the underlying subject dataset and limit the utility of our synthetic dataset.

Conclusions

This experiment demonstrates the critical role of referential integrity in generating high-quality synthetic data for multi-table datasets. While SDV offers flexibility and is widely used in research contexts, it struggled to preserve the trends in the underlying subject dataset with multiple foreign keys in this setup. In contrast, MOSTLY AI’s Synthetic Data SDK consistently delivered high utility and privacy-preserving synthetic data with strong referential integrity across all dimensions, making it a reliable solution for production use cases involving temporal and relational complexity.

If your use case involves analytics, compliance testing, or anything that relies on downstream joins, synthetic data generation by MOSTLY AI offers the kind of structural fidelity data professionals need to be successful. It’s a production-ready option for organizations that need synthetic data that is actually better than the real thing.Try the Synthetic Data SDK by MOSTLY AI and experience state-of-the-art synthetic data generation that’s as realistic as it is secure.