Synthetic data is data that's artificially generated rather than collected from real-world events. It is typically created with the help of algorithms or simulations and often used in settings where real-world data is hard to collect or where privacy concerns exist.
The term is not new and has been around for many years. In the past synthetic data was most often understood as “rule based” synthetic data. That is a user would define specifically the rules upon which the data would be generated. For example: create a numerical variable without any decimals with a range from 100 to 1,000 with a normal distribution.
When we talk about synthetic data, we mean machine learning generated synthetic data. For this kind of synthetic data Generative AI is used to create data that can be highly complex - far beyond what a user could describe with simple rules. The result is data that looks and feels just like real-world data and that contains all its statistical information but no Personal Identifiable Information (PII).