Using MOSTLY AI Generators as Universal Prediction Models

Written by

Matthias Funke

In order to generate high quality synthetic data, the MOSTLY AI Platform uses advanced deep learning models. Our models learn everything there is to know about your data! This begs the question: can the models also infer values for unseen data? The answer is: yes. In this short blog post we'll introduce how MOSTLY AI Generators can be used as flexible and accurate Universal Prediction Models. Granted: it's a slightly unusual use of our Platform, but exciting and quite powerful nevertheless!

Imagine a training a Generator with a dataset with 100 columns. When we then generate synthetic data from scratch, we typically produce 100 values per row, all correlated to each other just how the Generator learnt during model training. However, if you happen to have 99 of the values and are missing just one, you can present the Generator with your 99, and ask it to generate just the last missing one.

Indeed, you can flexibly seed the Generator with any number of columns and expect the Generator to fill in the rest. The only requirement is that the values presented to the Generator are from within the range of the training data. For example, if the training data contained a column with 17 different values, you should not present an unseen value #18. Similarly, if column X contained values between 0 and 1000 in the training data, you should not ask for a prediction based on a negative value. The model may not know what to do!

It is this flexibility when generating synthetic data that makes the MOSTLY AI Platform a powerful tool in the stack for data scientists. On the one end of the spectrum you are creating fully synthetic data, on the other end you are effectively inferencing. Everything in between is a hybrid.

Many data scientists will know exactly how to build a machine learning model to infer a specific column, but the universality of a Generator makes it attractive for all data professionals. You cannot only infer one variable at a time, but as many as you like. And this can be done very easily using the Seeded or Conditional Generation feature of the MOSTLY AI Platform.

To make this concept as concrete as possible, we have prepared a code notebook for you. There, we use a 50k rows US Census dataset for the experiment. We ask the Generator to infer “race” and “marital status” for 10k rows (test split). We achieve about 80% accuracy for inferring both variables at the same time. In comparison: dedicated models for each variable individually achieve slightly higher accuracy of 83% and 87%.

Check out how everything works in this short video.

We want to hear from you: what unusual use cases are you using Generators for?