>

Resources

>

October 15, 2025

Mock Data That Acts Like Reality

Written by

I wanted to generate realistic mock data: structured, relational, and coherent enough to actually build with.

That’s where I started—not as someone at MOSTLY AI, but as someone in the shoes of a data scientist or engineer who just needs believable data to begin. Something that looks and behaves like production data, without exposing any of it.

So I did what most people would do today. I opened an LLM.

Starting with Claude

I went to Claude, one of the most advanced language models available, powered by Sonnet 4.5. My prompt was simple:

“Generate a realistic healthcare insurance dataset with 5 related tables: Patients, Policies, Claims, Providers, and Payments.”

And, to its credit, Claude delivered something that looked impressive. There were tables, numbers, validation summaries, even claims of referential integrity and cross-table logic.

For a moment, it seemed perfect.

But when I asked it to show me the same claims again—just to recheck the exact values it had shown before—the illusion broke.
The data regenerated from scratch. IDs changed, payment amounts shifted. What appeared to be structured was, in reality, language dressed up as data. Each new request rewrote the story instead of preserving the truth.

That’s when the difference between text generation and data generation became clear.

Turning to MOSTLY AI

Next, I ran the exact same prompt inside the MOSTLY AI Assistant, which also uses Sonnet 4.5 under the hood.

But here’s the difference:

The Assistant doesn’t just generate text that looks like data—it applies structured prompting and validation logic designed to preserve relationships, maintain consistency, and automatically correct errors across tables.

Within minutes, it generated all five tables: Patients, Policies, Claims, Providers, and Payments. Every key linked correctly. Every relationship was validated. Any inconsistency was automatically corrected.

I ran both experiences side by side. In Claude, I could ask for corrections, but every new request rewrote the dataset entirely. In MOSTLY AI, when I validated the same claims, the engine preserved IDs, linked records correctly, and repaired mismatches automatically. What looked like static text in one became dynamic, consistent data in the other.

Then I exported the results as downloadable CSVs—actual, persistent mock data ready for analysis, testing, or exploration.

And that’s the moment I realized something simple but profound:
Claude generates words that look like data.
MOSTLY AI generates data that behaves like reality.

Why This Matters

Mock data might sound trivial, but it’s foundational.

For many teams, mock data is the first step toward broader data transformation—the bridge between what’s theoretical and what’s real. It’s how you build trust, test architectures, and prove value before production access.

If you can create mock data that behaves like reality, you earn the confidence to take the next step: introducing real data safely, and eventually generating synthetic data that can be shared and scaled without privacy risk.

That’s the natural progression. Not a strict sequence of mock, real, and synthetic data, but an evolution of trust.

This experiment—starting with a simple healthcare dataset—showed just how powerful that foundation can be.

What’s Next

This marks the first chapter in a broader story about how teams can responsibly evolve their data practice.

Next, I’ll explore how we move from realistic mock data to privacy-safe synthetic data, and what happens when AI can learn from data that no longer belongs to anyone, yet still represents everyone.

Because the future of data isn’t about making up numbers. It’s about making meaning possible—safely and intelligently.

🎥 Watch the companion video:

Related posts

Accelerating Data Innovation with MOSTLY AI’s GenAI Synthetic Data Platform and Databricks

October 4, 2024

Introduction In today’s data-driven world, the ability to leverage high-quality, privacy-safe data is paramount. The MOSTLY AI Platform provides a suite of advanced data capabilities, including synthetic data generation and generating AI-driven insights. To streamline the integration of these capabilities within Databricks, we’ve developed a comprehensive Solution Accelerator. This blog post outlines the goals and […]

Survival Analysis on Private Synthetic Data

Mariana Vargas-Vieyra

Access to healthcare data is crucial for conducting research on machine learning applications for the medical domain, particularly for developing models that can effectively address critical healthcare questions. For instance, Survival Analysis (SA) models trained on clinical trial data can predict patient survival times, disease recurrence, or responses to specific treatments, significantly informing clinical decision-making. […]

Data for Everyone

TL;DR: LLM + Python = Data Superpowers, Smart Datasets + Artifacts = Share those Superpowers with Everyone LLMs changed how we work with information. ChatGPT and Claude can reason around data in ways that seemed impossible just years ago. But they're wrapped in environments designed for conversation, not data work. You can analyze a dataset, […]

Ready to start?

Sign up for a free account or contact our sales team to schedule a demo.

Get started free Request a demo