Once MOSTLY AI analyzed the database’s schema, it will ask you to classify which of its tables are subject tables and rank them in order of importance. This is an important step, as it determines which tables contain privacy-sensitive information and how accurately their relationships can be rendered with respect to referential integrity.
With large and complex databases, privacy-sensitive information can be scattered across multiple tables. Therefore, MOSTLY AI needs to protect not only the tables that contain the profiles of data subjects, but also the tables that reference them. These are referred to as subject tables and linked tables, respectively.
As MOSTLY AI cannot infer from the database schema which tables are subject tables, you’ll need to manually classify them in this step. MOSTLY AI can then automatically find and privacy-protect the linked tables, as they contain references to the entities in the subject tables.
You can identify subject tables by the type of attributes they have. These describe profile information about the data subjects in your database. Here you can think of names, places of residence, email addresses, birthdates, and other types of privacy-sensitive information.
Subject tables have the key characteristic that each row describes a single, unique profile that refers to an entity whose privacy you explicitly want to protect. To ensure that MOSTLY AI’s privacy protection will work reliably, these tables also need to comply with the following requirements:
We recommend that subject tables have more than 5000 subjects.
The minimum size is 100 subjects. However, the more subjects there are available, the better the training algorithm can generalize their features, which results in a decreased privacy risk.
The maximum number of subjects and columns is determined by your license.
Each subject must refer to a distinct real world entity.
Each row describes one subject.
Each row can be treated independently.
The rows' order carries no information, and the contents of one row do not affect other rows.
Please ensure that the column names do not contain any privacy-sensitive information.
Avoid column names such as
vendor_b_purchases, etc.. Not only would vendor names already appear in the metadata, but they could also slip through rare category protection (e.g., there’s a
vendor_acolumn, but this vendor only appeared five times in the whole dataset). You can solve this problem by simply having a
vendorcolumn with the vendor names in it.
Please follow the steps below to classify and rank your database’s subject tables.
Classify tables that contain profile information about data subjects as subject tables by dragging them from
Database contentsto the
Subject tablespane. This will privacy-protect them and their linked tables, which are tables that contain references to these subjects.
When classifying subject tables, please keep in mind that personal or otherwise sensitive information can be scattered across referenced and referring tables. If the table you want to classify as a subject table has a parent that more immediately describes the entity whose privacy you want to protect, then choose that table instead.
Next, rank the subject tables by their order of importance in the database.
Drag the tables in the
Subject tablespane up or down to do so.
MOSTLY AI will use this ranking to determine how referential integrity will be maintained when a linked table references multiple subject tables in the database. The relationship to the subject table that is ranked higher will take precedence over those further down the list, resulting in a more accurate rendition of that relationship.
Lastly, select the remaining tables you want to include in the synthetic version of the database.
In the next step, you will see which of these tables have been classified as linked tables or reference tables.
Tablescheckbox to select all tables.