The Gold Standard for Data Governance: Databricks Unity Catalog & MOSTLY AI’s Synthetic Data

Written by

TL;DR Databricks Unity Catalog provides a unified data governance framework for data assets, offering granular access controls and traditional data masking capabilities. MOSTLY AI’s Synthetic Data complements this by generating fully anonymous, synthetic datasets via Generative AI, thus offering a solution to the limitations of traditional data anonymization methods.

Databricks Unity Catalog

Databricks Unity Catalog offers a unified governance solution for assets on Databricks. Administrators can grant permissions for all assets including files, tables, and ML models. For tabular data, Unity Catalog allows you to control on the row and column level who gets access to what.

Unity Catalog also comes with a simple form of data anonymization: data masking. Data masking is an approach to obfuscate sensitive data by replacing it with neutral data. Unity Catalog lets the user define “Column masks” that apply a masking function to a column. The masking logic can be very simplistic (e.g. just outputting stars instead of the actual data), or more sophisticated taking in additional columns as input parameters.

The following example from the Databricks documentation shows how the ssn column of a users table can be configured to only return ***-**-**** instead of the actual Social Security number, except for members of the user group 'HumanResourceDept' who will see the actual ssn.

CREATE FUNCTION ssn_mask(ssn STRING)
  RETURN CASE WHEN is_member('HumanResourceDept') THEN ssn ELSE '***-**-****' END;

--Create the `users` table and apply the column mask in a single step:

CREATE TABLE users (
  name STRING,
  ssn STRING MASK ssn_mask);

So, to sum it up - Unity Catalog allows you to define granular access and data privacy controls within Databricks. In addition, it provides data masking of tabular data. While this will get you quite far in order to comply with data privacy regulations, there are situations where this will not be sufficient. And the reason for that are the limitations of traditional data anonymization approaches.

Limitations of Traditional Data Anonymization

Traditional data anonymization methods such as masking, generalization, obfuscation, adding noise, etc. face key limitations, particularly as data privacy requirements become more complex and demanding. Here’s a detailed look at the primary downsides:

Risk of Re-identification: Traditional methods primarily focus on direct identifiers like names, or social security numbers that can uniquely identify individuals. However, these methods overlook quasi-identifiers—attributes that don't uniquely identify individuals but can be combined with other quasi-identifiers to re-identify individuals. Examples of quasi-identifiers include age, gender, zip code, and profession. Even when direct identifiers are masked, datasets containing these quasi-identifiers can still be linked with other public datasets to re-identify individuals, posing significant privacy risks.
Reduced Data Utility: By altering or removing data to prevent identification, traditional data anonymization methods can degrade the quality of the data, making it less useful. For instance, masking specific attributes in a dataset might prevent the use of that data in predictive models where those attributes are essential, thus limiting the potential insights from the data.
Manual Effort to Define Rules: Traditional data anonymization requires substantial manual effort to define and apply rules on a column by column level. Each dataset might require a unique set of rules based on the sensitivity of the data, the context of its use, and the applicable regulations. This process is both time-consuming and prone to human error. The need to continuously update and maintain these rules as new data is collected or as legal standards evolve adds an additional layer of complexity and effort.

These limitations highlight why simply masking direct identifiers is not enough to ensure data privacy for highly sensitive data. Now that we are aware of the limitations of traditional anonymization techniques, let’s turn our attention to the alternative: GenAI created synthetic data.

MOSTLY AI’s Synthetic Data

Synthetic data - as we refer to it - is data that is created using Generative AI that was trained on original data to learn from. A Generative AI model learns all the patterns and characteristics of the original data and then can be used to generate a new version of that data: an artificial version - a synthetic dataset. And because synthetic data is artificially created from scratch there is no 1:1 link to the original data anymore. Executed well, together with a bunch of sophisticated privacy preserving measures, synthetic data is fully anonymous data that doesn’t contain any Personal Identifiable Information (PII). Synthetic data is used in industries where data is sensitive and privacy of utmost importance, industries such as banking, insurance, healthcare, and telecommunications. Of course these are industries that Databricks serves as well - and this is where we can bring everything together.

Bringing it all together

If you are using Databricks, leveraging Unity Catalog will be your first step to securely manage datasets and implement your data governance approach. You can make sure that the right individuals will get access to the right datasets in your organization. And if you are not dealing with very sensitive data, you will probably rely on the in-built data masking capabilities to de-identify tabular data. But once you have the need for fully anonymous data (e.g. when you want to share highly sensitive datasets) you will start looking for an alternative approach and that is where the MOSTLY AI Synthetic Data Platform will come into play.

Because of our partnership with Databricks, our platform natively integrates with Databricks and via Unity Catalog you can grant MOSTLY AI access to directly read highly sensitive data in order to train a Generative AI model. While the MOSTLY AI Platform (securely installed in your compute environment) will have access to the sensitive data, even the users of the MOSTLY AI Platform will not. The training of the Generative AI model with that data, and the subsequent creation of a Synthetic Dataset is something that you’ll be able to conveniently control natively in Databricks leveraging the MOSTLY AI Python Client. No need to configure any masking rules, no need to worry about the risky disclosure of combinations of un-masked attributes. The Synthetic Data will be private and fully anonymous as every attribute of the data will be considered potentially sensitive and synthetically generated. The Synthetic Data can then be stored again directly into Databricks Unity Catalog where you can grant many more users in your organization access to that data.

That’s what we call true data democratization!

If you’re curious to learn more about our partnership with Databricks click here.