Key features

  • Stores technical metadata, such as the names and number of tables and columns in a data source, data types, and relationships across tables.

  • Enables MOSTLY AI to synthesize entire Oracle, MS SQL, and PostgreSQL databases.

  • Automatically determines the best synthetization approach to maintain the referential integrity of the original database, while preserving the security of your sensitive data assets.

  • Doesn’t retain any static data in memory.

  • Allows admins to manage a database’s synthetization settings.

  • Enables automation of synthetic data delivery pipelines.

  • Simplifies manual synthetization tasks as users don’t need to set up jobs from scratch

Data catalogs connect to your data sources, extract information about the data inside, and set them up for repeatable synthetic data generation. They simplify the management and reuse of synthetization settings for CSV and Parquet files.

What is a data catalog?

In simple terms, a data catalog is a document that describes what`s available in a data source, together with the requirements to create a privacy-secure version of it’s contents. It gives you the certainty of knowing what you’re going to get, without needing a detailed insight into the actual data.

Better yet, data catalogs allow for the contents to change, providing you with the opportunity to share the latest, up-to-date synthetic copies of that data source.

Let’s expand on this with the analogy of a library. In their catalog, you’ll find records about the books they have on offer. These records tell whether a book is available, where you can find it, and provide a short description of the book’s contents.

A data catalog in MOSTLY AI is very similar to these records. But to understand to capabilities of this data catalog, let’s envision having records for newspapers rather than books in the library. These records couldn’t describe the contents of a newspaper as they change all the time. Instead, these records would say something about the themes they cover and how the newspaper is structured — front page articles, feature articles, international news, editorials, etc.., allowing you to easily find a particular point of view on the latest events.

Similarly, a data catalog describes the structure of the contents of a data source — how many tables are there, how are they related, which columns do they have, and how should they be synthesized?

In this way, a data catalog allows the original data to change as long as its structure stays the same, enabling you to share the latest, up-to-date synthetic copy across your company and partnerships.

MOSTLY AI’s database synthetization features

Databases often have complex relationships between various types of data. In order to create a synthetic version of such a database that is usable, realistic, and privacy secure, the referential integrity of the original database needs to be maintained while preserving the security of your sensitive data assets.

To realize this outcome, our engineers and data scientists came up with a set of features that can reliably create a synthetic version that meets the criteria mentioned above. Before we dig into these features, let’s first take a birds-eye view of the synthetization process and discuss the challenges these features seek to address.

Introducing Context and Smart Select foreign keys

The synthetic data generation process works by learning the statistical patterns, distributions, and correlations across the tables that describe your data subjects (e.g., users, customers, and business partners) and the other tables in your database (e.g., products, orders, and payments). It will then use this information to create an entirely new set of fictional characters that faithfully represent the original population in the synthetic version of this database.

One of the key challenges here is that sensitive, personal data may be linked across multiple tables, and that these referring tables also can have links across each other. MOSTLY AI solves the issue of privacy security and referential integrity by categorizing the relationships of complex databases into two types—Context foreign keys and Smart Select foreign keys.

When the synthetization process recreates the original data subjects into a new set of fictional characters, the Context foreign keys assure that data that belongs to these subjects are synthesized together. This ensures synthetic data accuracy and privacy security.

However, once all tables have been synthesized, what remains are a number of relationships that are necessary for maintaining referential integrity, but which are completely broken if left unprocessed. What happened is that the synthetic versions now entirely consist of fictional characters and events that, on an individual level, bear no resemblance to the original data subjects.

MOSTLY AI solves this problem in the following way:
Prior to synthetization, it categorized these relationships as Smart Select foreign keys. During the generation of the synthetic tables, it maps the characteristics of these relationships, after which it will find the appropriate matches between the entries in the synthetic referenced and referring tables. As a result, the referential integrity of the synthetic versions of your database remains fully intact, realistic, and usable.

Introducing Subject tables, Linked tables, and Reference tables

The other challenge in synthesizing databases is to have a way to tell MOSTLY AI which tables contain the data subjects and the sensitive, personal data that relates to them. This is where Subject tables, Linked tables, and Reference tables come in.

When configuring a data catalog for a database, MOSTLY AI will only ask which of the database’s tables are subject tables. It will then automatically identify the remaining types by analyzing the database schema.

Subject tables contain the profiles of your data subjects — a set of attributes that say something about them. Here you can think of names, places of residence, email addresses, birthdates, and other types of privacy-sensitive information. With large and complex databases, these details can be scattered across multiple tables. These can include address change histories, personal interests, income, or average spend over time.

A subject table is the parent of these tables, with the key characteristic that each row describes a single, unique profile that refers to an entity whose privacy you explicitly want to protect.

Linked tables, on the other hand, refer to these subject tables. They’re, however, not limited to a single row of information per subject. The details relating to a subject can span multiple rows and can describe events, such as website visits, orders, and payments.

MOSTLY AI will process these entries as properties of subjects. This means that they cannot exist without subjects, but subjects can have zero entries in a linked table. As such, each entry in a linked table needs to have a foreign key referencing their parent table.

And lastly, there are also Reference tables. These describe real-world entities that do not need privacy protection, such as product inventories or business partners. If you didn’t classify them as subject tables and MOSTLY AI determines that they serve as lookup tables for other tables, then they won’t be synthesized but copied to the resulting synthetic database.

Applying these features to an example database

Now that we touched on the features that make database synthetization possible, let’s apply them to a small, three-table database to exemplify how they would work in practice. The schema of this database is shown here below.

Customers - Orders - Payments database schema

Identifying the subject tables

To have MOSTLY AI privacy secure the synthetic version of your database, it needs to know which tables are the subject tables. It will then automatically identify the linked tables and reference tables and configure the relationships accordingly.

In our example Customers - Orders - Payments database, it’s fairly straightforward to see which table contains profile information of the data subjects. Here, we would classify the Customers table as the subject table. The Name, Surname, Gender, Country, and Birthdate fields clearly describe the real-world entities whose privacy we want to protect, and there are no other tables that contain profile information.

Of course, the Orders and Payments tables also contain sensitive information. But as they contain details about the entities in the subject table, MOSTLY AI will privacy-protect them in the context of being linked to the Customers table.

Customers - Orders - Payments database schema

Classifying the right tables as subject tables might become a bit more tricky if you need to synthesize a database that contains more than one table with profile information. How do you know which of these are subject tables and which aren’t?

Let’s explore this scenario using a hypothetical Patients - Doctors - Treatments database. This database contains two tables with profiles that refer to real-world entities, namely Patients and Doctors. Here, you would then classify the Patients table as a subject table so that MOSTLY AI will protect their privacy by synthesizing them into fictional characters.

For the Doctors table, you would have two options. As it contains profile information, such as the doctors' names, specializations, activities, working hours, and places of work, you could consider classifying it as a subject table. The resulting synthetic database would have fictional patients visiting fictional doctors. Or, if it is deemed that the information about the doctors is public knowledge, you could choose not to. MOSTLY AI would then create a synthetic database where fictional patients visit real-world doctors.

Configuring the Smart Select foreign keys

Once we classified the Customers table as a subject table, MOSTLY AI would then determine that there are two Context foreign key relationships in the database, Customers - Orders and Customers - Payments and a Smart Select foreign key Orders - Payments. The Context foreign key relationships would be synthesized in the following way:

  1. First, the Customers table will be synthesized.

  2. Then, the Orders table will be synthesized with the synthetic Customers table as its context.

  3. Lastly, the Payments table will be synthesized, also with the synthetic Customers table as its context.

The resulting synthetic tables contain a fictional set of customers who placed fictional orders and made fictional payments, where the Customers - Orders and Customers - Payments relationships retained the original data’s statistical patterns, distributions, and correlations with the highest possible accuracy.

Customers - Orders - Payments database schema

But what about the Orders - Payments Smart Select foreign key relationship? To maintain the referential integrity of the original database, MOSTLY AI must somehow link the new, fictional entries in the Orders table to the new, fictional entries in the Payments table.

Customers - Orders - Payments database schema

The underlying algorithm that generates this relationship achieves this by mapping the characteristics of the relationship between the original referenced and referring tables. Using those characteristics, it will then find the appropriate matches between the entries in the synthetic referenced and referring tables.

A good analogy for the Smart Select algorithm is the job of a hiring manager. Let’s say that you have a list of all vacant positions in your company and a list of potential candidates for these jobs. A hiring manager would look at the requirements and expectations for each of these jobs and match the appropriate candidates based on their abilities, skills, work experience, and interests. For example, someone with a computer science degree and ten years of experience in software development would be a suitable candidate for a position in engineering. Similarly, someone with an extensive blogging portfolio, SEO skills, and web analytics knowledge could make a great contributor to a marketing team.

The Smart Select algorithm evaluates the referenced and referring tables of the synthetic database in a similar way:

  1. First, you’ll need to help the algorithm a little bit and tell where it can find the "requirements and expectations" to match the entries in the synthetic referenced and referring tables. Your original referenced and referring tables, however, won’t state these things explicitly. In the UI, you can specify which columns the algorithm can look at to learn the correlations between the attributes in the original referenced and referring tables. We recommend selecting attributes that are suitable for the purpose. Work experience, abilities, and skills would be very relevant when matching candidates to vacant positions, while place of birth, age, and gender wouldn’t be at all.

  2. Next, during the training of the synthetic data generation models, it will learn these correlations and use them to sketch an outline of how possible entries in the synthetic referenced and referring tables can be matched.

  3. Once the synthetic versions of your table have been generated, this information is then used to select the appropriate entry in the synthetic referenced table for each entry in the synthetic referring table.

  4. And lastly, the Smart Select algorithm populates the referring table’s foreign key column with the appropriate primary keys.

To configure our Orders - Payments Smart Select foreign key relationship, let’s use the above four-step process as a guideline on how to do so:

  1. First, we need to specify the Smart Select columns that the algorithm will use to learn the correlations between the attributes in the referenced and referring tables. In the case of the Orders - Payments relationship, these would be the Order date and Order status columns of the Orders table. Another possibility would be to include the Product ID column. Even though it’s a foreign key to a reference table, there’s a strong correlation between these IDs and the values in the Payments table' Amount column.

    Customers - Orders - Payments database schema
  2. Next, during the training of the AI models that result in the synthetic versions of the Customers, Orders, and Payments tables, it will learn the correlations for which we specified the Smart Select columns.

  3. Once the synthetic versions of your table have been generated, it will use the resulting mapping of the correlations to select a matching entry in the synthetic Orders table for each entry in the synthetic Payments table.

  4. Lastly, the result is then written to the Order ID foreign key column of the Payments table.

The result is a synthetic database that is privacy secure, accurate, and that maintains the referential integrity of the original.