In this blog post, we present the results of a benchmark study comparing the performance of regression models trained on synthetic and original data. Our aim is to evaluate how well models trained on synthetic text, generated using the MOSTLY AI platform, perform compared to models trained on original data, as well as synthetic data generated by GPT-4o-mini. We use a dataset containing wine reviews, with each review consisting of free-text descriptions and a corresponding rating between 85 and 100 points. The task is a regression problem, where the goal is to predict the wine rating based on the review text.

Methodology and Experimental Setup

Original Data: The dataset consists of two columns:

  • Text column [notes]: Contains free-text descriptions/notes of the wine provided by reviewers, which may include tasting notes, aromatic qualities, or comments on the wine’s body and flavor profile.
  • Rating column [score]: A numerical rating between 85 and 100 points that reflects the overall quality of the wine.

Synthetic Data Generation with MOSTLY AI: Using the MOSTLY AI platform, we generated synthetic versions of the wine review dataset by:

  1. Fine-tuning microsoft/phi-1_5 and Locutusque/TinyMistral-248M pretrained models on the original data.
  2. Training a custom LSTM language model from scratch on the original data.

These models generated synthetic wine reviews while maintaining the structure and statistical properties of the original data, all in a privacy-preserving manner. The score column is synthesized by MOSTLY AI’s generative models for structured data which retain the correlation to the text/notes column. 

Synthetic Data Generation with GPT-4o-mini: We explored two approaches using GPT-4o-mini:

  1. Zero-shot approach: We prompted GPT-4o-mini to generate synthetic wine reviews and ratings without providing any examples.
  2. Few-shot approach: We supplied GPT-4o-mini with around 50 example review-rating pairs to guide the generation of additional synthetic data.

Train synthetic, test real: While training the regression model on synthetic data, we hold out a fraction of the original data for evaluation. This ensures that the test set is not used for training either for training/fine-tuning the models on the MOSTLY AI platform or the regression models. This setup guarantees a fair evaluation of the generalization ability of models trained on synthetic data.

Regression Model: To evaluate performance, we used AutoGluon, an automated machine learning (AutoML) tool that automatically selects and tunes the best regression model based on the synthetic or original training data. The models were then tested on the real holdout data.

Results: Comparing the Accuracy of a Regression model trained on Synthetic Data

Here are the accuracy scores (root-mean squared error) on the real holdout dataset, based on the models trained on both original and synthetic data:

Data set for training the regression modelRMS score (lower is better)
original data1.87
synthetic-data  based on fine-tuned MS-Phi1.99
synthetic-data based on fine-tuned tiny Mistral2.07
synthetic-data based on trained LSTM2.09
few-shot synthetic data from GPT-4o-mini2.12
zero-shot synthetic data from GPT-4o-mini4.02

Examples of Synthetic Wine Reviews

To illustrate the quality of the synthetic data generated by the MOSTLY AI platform, here are two examples of synthetic wine reviews produced by the microsoft/phi-1_5 and LSTM models:

  1. Synthetic Review Example 1 (phi-1_5i)
    • Text: "Dark ruby red color. Nose very complex floral with notes of violet, rose, thyme and wild berries. Nice, silky, well-structured palate, quite open. Very fat, full, long lasting finish."
    • Predicted Rating: 91
  2. Synthetic Review Example 2 (LSTM)
    • Text: "Aromas of dried flowers, red fruits and hints of orange rind. Greatly rich and aromatic, showing great balance. Rich, complex fruity and complex with notes of peaches and honey. Already aptly in its youth with an excellent aging potential."
    • Predicted Rating: 89

These synthetic reviews demonstrate the model's ability to generate coherent, high-quality text that accurately reflects the attributes of the wines while preserving privacy.

Key Insights

  1. Synthetic data from MOSTLY AI rivals the performance of original data: Models trained on synthetic data generated by MOSTLY AI achieved accuracy scores close to those trained on original data, showing minimal loss in prediction performance.
  2. Few-shot GPT-4o-mini performs moderately well: While the few-shot approach with GPT-4o-mini produces more accurate results than the zero-shot method, its performance still lags behind the synthetic data generated by MOSTLY AI.
  3. Zero-shot GPT-4o-mini generates the weakest results: Without task-specific training or examples, the zero-shot approach from GPT-4o-mini produces the lowest accuracy in predicting wine ratings, highlighting the limitations of generalist models without domain-specific fine-tuning.

Conclusion: Synthetic Data with MOSTLY AI Outperforms Zero/Few-Shot Approaches

This study demonstrates that when original data is not available due to privacy constraints, synthetic data generated by MOSTLY AI provides a highly effective alternative. The regression models trained on MOSTLY AI’s synthetic data achieve nearly the same accuracy as those trained on original data, making it a powerful tool for privacy-preserving data analysis.

In contrast, while GPT-4o-mini can generate synthetic data without fine-tuning, the performance gap is evident, especially in the zero-shot scenario. For simple tasks, the few-shot approach with GPT-4o-mini offers moderate performance, but it still falls short compared to synthetic data produced by more specialized models.

For enterprises looking to balance privacy with performance, MOSTLY AI's platform offers the most reliable solution for generating synthetic data that can match the accuracy of models trained on real-world data.