Turning a Complex Synthetic Data Workflow into a Simple Dialogue

Written by

Synthetic data workflows have traditionally required a mix of engineering effort, platform knowledge, and careful coordination across tools. You upload a dataset, write code to explore it, adjust preprocessing steps, configure a generator, train it, generate data, build a model, and eventually compare results. It is powerful, but rarely simple.

This time, I tried something different.

I uploaded the Census Income dataset into the MOSTLY AI Assistant and let the Assistant guide the entire process. What normally takes multiple steps and manual effort unfolded through a single conversational interface. From exploratory analysis to a full Train Synthetic, Test Real (TSTR) evaluation, the workflow was handled end to end.

The result was not just a good synthetic data experiment. It was a different way of working with data altogether.

1. Starting with a Simple Request

I began by asking:

“Analyze this data.”

The Assistant immediately:

inspected the schema
profiled numeric and categorical variables
calculated distributions
generated visualizations
highlighted which features most influenced income
produced summary statistics
saved shareable files and reports

It delivered the type of analysis that usually requires a notebook session and a fair amount of Python. Here, it arrived instantly and in an organized, readable format that set the foundation for the rest of the workflow.

2. Automatically Configuring a Synthetic Data Generator

Next, I asked:

“Can you configure an optimal synthetic data generator for this dataset?”

Using what it learned during the analysis, the Assistant:

selected an appropriate model size
enabled value protection
activated flexible generation
created a full configuration document
and then created the generator itself

This is typically a manual configuration step, but in this case the Assistant handled it with a single sentence.

3. Launching a Complete TSTR Experiment Through One Prompt

To validate predictive utility, I asked:

“Create an 80/20 split, train a generator on the 80 percent, generate synthetic data, train a model on both datasets, and compare the model performance on the real test set.”

The Assistant took over and sequentially handled every step:

created a stratified train test split
trained the generator only on the training data
generated a synthetic dataset
trained an XGBoost model on the real training split
trained an XGBoost model on the synthetic data
evaluated both models against the same real 20 percent holdout

Along the way, the Assistant automatically managed the preprocessing required to prepare both datasets for modeling, including aligning column order and data types.

No notebook or custom code was needed; the Assistant handled everything.

4. Comparing Real and Synthetic Model Performance

The results were strong and in line with what you expect from high-fidelity synthetic data.

The synthetic-trained model performed within roughly one percentage point of the real baseline. This confirms that the generator learned the underlying relationships needed for predictive modeling.

5. What Stood Out Most

TSTR itself is not new. It has long been a standard way to assess synthetic data quality.

What is new is how the experiment ran.

The Assistant did not just generate data. It coordinated the entire workflow:

analysis
generator configuration
generator training
synthetic data production
ML training
evaluation
reporting

All through natural language.

This eliminates the friction that usually exists between wanting to evaluate synthetic data and having the time, tooling, and expertise to run a full experiment. Anyone on a team can now run a validated synthetic workflow without switching tools or writing code.

6. Why This Matters for Data Teams

This approach changes how teams interact with data:

Experimentation becomes faster because setup work disappears
Analysis is available instantly without notebook overhead
Synthetic data can be validated early in a project rather than late
Privacy-safe data becomes easier to share and collaborate on
Non-technical users can participate directly in the process

Instead of building and maintaining pipeline steps by hand, teams can focus on evaluating outcomes and exploring new ideas.

7. Final Thoughts

Running a TSTR experiment on its own is straightforward. Running the entire process — analysis, generator configuration, training, synthetic data generation, modeling, evaluation, and reporting — through conversational interaction is something different.

It shows what happens when synthetic data capabilities, machine learning workflows, and a natural-language interface are all connected in one environment.

This demonstrates how the MOSTLY AI Assistant turns a complex workflow into a simple dialogue while still producing rigorous, meaningful results.