Introducing the MOSTLY AI PRIZE

Written by

TL;DR The MOSTLY AI PRIZE is a global competition inviting data scientists, researchers, developers, and really everyone to create the most accurate and privacy-safe synthetic tabular data. With a grand prize of $100,000, the challenge runs from May 14 to July 3, 2025. Participants will work with real-world datasets to develop synthetic versions that maintain statistical fidelity while ensuring data privacy. The competition aims to advance the field of synthetic data, promoting open data accessibility and innovation in AI.

Motivation

Open Source is essential for a healthy AI ecosystem – and Open Data is just as vital. Yet today, it's mostly scraped web data dominating the field. But the intelligence of tomorrow won't be built on tweets and cat pictures alone. The next frontier lies in observational and behavioral data – granular datasets about demographics, mobility, housing, health, education, transactions, media consumption, and more. These are the building blocks to understand our world and build truly inclusive AI systems. We must therefore promote access to and sharing of these valuable information sources to ensure AI development benefits everyone.

High-fidelity, privacy-safe Synthetic Data generators can allow data owners to share the knowledge of these assets while protecting individual-level information. Data assets, that were previously locked up, can be shared across teams, across borders, across organizations. True collaboration on otherwise locked up data becomes possible again!

We at MOSTLY AI are on a mission to make this vision a reality. Our open-source Synthetic Data SDK helps anyone generate privacy-safe data at scale. This competition is a call to action - help us push synthetic data further and make open data truly accessible! 🌍

Competition

Remember the legendary Netflix Prize? It shaped an entire industry - but privacy issues forced its sequel to be cancelled. Why? Because anonymization doesn't work on rich, high-dimensional data. It's time for a fix: Synthetic data to the rescue!

This competition features two independent synthetic data challenges that you can join separately:

The FLAT DATA Challenge
The SEQUENTIAL DATA Challenge

For each challenge, your task is to generate a new dataset that matches the size and structure of the original, while preserving its statistical patterns - but without being significantly closer to the (released) original samples than to the (unreleased) holdout samples.

To succeed, you'll typically train a generative model on the training data, ensuring it generalizes well without overfitting. You're free to build upon any existing open-source library (Synthetic Data SDK, synthcity, reprosyn, etc.), or start building your own solution from scratch. Just make sure your submission is end-to-end open-source, reproducible, and can run in under 6 hours on a standard machine.