🚀 Launching Synthetic Text to Unlock High-Value Proprietary Text Data
Read all about it here
December 11, 2023
3m 30s

MOSTLY AI: Synthetic Data Generation with Built-in Privacy

Join us in this tutorial as we delve into the revolutionary privacy-enhancing technology of privacy-safe synthetic data generation using MOSTLY AI's state-of-the-art synthetic data platform.

➡️ Discover how the platform utilizes state-of-the-art privacy mechanisms to anonymize and safeguard your data while maintaining its utility and integrity.
➡️ Learn about the innovative techniques employed by MOSTLY AI to create highly realistic, synthetic data that ensures privacy without compromising on accuracy.
➡️ Understand the training process and explore various privacy safeguards, like rare category values and extreme value protection.

🖇️ Sign up for your free synthetic data generation account here: https://bit.ly/3GyDSHC

🖇️ Learn more about data anonymization tools and how to choose the best method for your needs from this blogpost: https://bit.ly/3TjF4GJ

Key Moments with Time Stamps:
00:01 - Introduction to Synthetic Data and MOSTLY AI
00:08 - Overview of Privacy Mechanisms in MOSTLY AI
00:16 - Balancing Data Utility and Privacy
00:27 - Training Generative AI Models with Original Data
00:46 - Ensuring No Direct Relationship with Source Data
00:59 - Behind the Scenes: Synthetic Data Generation Process
01:07 - Data Distribution Replication in Synthetic Data
01:34 - Built-in Safeguards for Data Privacy
01:49 - Overfitting Prevention and Individual Record Protection
02:02 - Handling Rare Category Values and Extreme Values
02:23 - Special Measures for Time Series Data
02:38 - Default Privacy Settings and Data Retention Policy
02:51 - How to Get Started with MOSTLY AI

Transcript

[00:00:01] In this short video, we're going to talk about how synthetic data generated in the MOSTLY AI platform is anonymized and made privacy-safe thanks to state-of-the-art privacy mechanisms. With MOSTLY AI, you can unlock the full utility of your original data and, at the same time, protect the privacy of your subjects.

[00:00:22] Unlike legacy anonymization techniques, MOSTLY AI only uses your original source data to train generative AI models. During the training process, the model learns the patterns and characteristics of your original data, and the platform then uses this AI model to generate brand-new, highly realistic synthetic data from scratch.

[00:00:46] As a result, the synthetic data bears no one-to-one relationship with the source data and ensures that it's not possible to re-identify any sensitive information directly.

[00:00:59] What's happening behind the scenes? MOSTLY AI generates synthetic data that follows the distributions of data from the source. A gender column from the original data with 45% male subjects would mean that a synthetic male subject would appear in our results about 45% of the time.

[00:01:19] Of course, this is a basic example, with the actual process taking into account the relationships and properties of all the columns in the data, making each synthetic record much more accurate and ensuring overall privacy.

[00:01:34] Now, there are several safeguards built into this process. First, the model takes care not to overfit, ensuring that it learns only the general patterns of the data and not the individual records themselves. Also, protection is applied for rare category values, where an attribute may inadvertently leak the presence of individuals in a dataset.

[00:01:58] To handle this, MOSTLY AI can generate synthetic data with a rare category value to ensure these attributes are privacy-safe. Likewise, extreme value protection removes extreme values from the data distribution of columns and ensures

[00:02:15] that the synthetic data does not reveal exceptional cases or outliers that could be used in a privacy attack.

[00:02:23] For time series datasets, there's also protection against extreme sequences of data to protect against re-identification of subjects and their behaviours.

[00:02:34] With MOSTLY AI, these and additional privacy settings are on by default, and the original data used for training is never retained. Any uploaded files are deleted immediately after your AI models finish training.

[00:02:49] Privacy is central to MOSTLY AI, with our platform evaluated and stress tested by researchers against privacy attacks without ever being compromised in any way. Our models learn data patterns without direct re-identification risk.

[00:03:05] We prevent overfitting and safeguard against outliers, with privacy being our default priority for your synthetic data.

[00:03:13] Visit mostly.ai to sign up for a free account to get started on your privacy-safe synthetic data journey.

[00:03:21] [music]

Ready to start?

Sign up for free or contact our sales team to schedule a demo.
magnifiercross