Tabular synthetic data
MOSTLY AI generates tabular synthetic data that provides privacy protection for data subjects and high accuracy while maintaining referential integrity and retaining the correlations between columns and tables (depending on the table relationship schema).
Scenarios
MOSTLY AI supports single-table, two-table, and multi-table synthetic data generation. In all scenarios, MOSTLY AI provides privacy protection and high accuracy.
In multi-table scenarios, MOSTLY AI again maintains the referential integrity of the original data and aims to retain the correlations between table columns and among tables.
For a review of the supported table generation scenarios, see the following sections.
Single table synthetic data
MOSTLY AI generates single-table synthetic data that retains all correlations between the columns in the original data.
Two-table synthetic data
In a two-table scenario, you have a subject and a linked table.
To generate synthetic data in such a scenario, you define a Context foreign key relationship from the linked to the subject table. MOSTLY AI can then generate two-table synthetic that maintains the referential integrity of the original two tables and retains the correlations within each table and between the two tables.
Multi-table synthetic data
In a multi-table scenario, you can generate data that spans three or more tables with relationships between them.
In such a scenario, you can have multiple relationship schemas between the tables (or multiple different ways in which the relationships between the tables can be configured). In all cases, MOSTLY AI maintains the referential integrity. Depending on the exact relationship schema, the correlations between specific tables can be fully retained, partially retained, or not retained at all.
For single table, two-table schema with a subject and linked tables, and star schemas with three or more tables, MOSTLY AI can generate synthetic data that fully retains the correlations from your original data.
For nested schemas, graph schemas with non-hierarchical relationships, or graph schemas with self-referencing relationships, MOSTLY AI synthetic data can retain some of the correlations from your original data on a "best-effort" basis, while others cannot be retained.
Before you learn about the details of retaining the correlations in different database schemas, learn about the main concepts of synthetic data when you use MOSTLY AI.
Table types
MOSTLY AI categorizes tables into three different types to aid the configuration of table relationships.
When you add tables to a new synthetic dataset in MOSTLY AI, all tables are added as subject tables. You can then configure the table relationships and MOSTLY AI distinguishes between subject and linked tables.
Subject tables
It is important to think of subject tables as the tables that contain one record per subject and it is the privacy of those data subjects (people, companies, any other entities) that you aim to protect with the generation of synthetic data.
Examples of subject tables can be users
, customers
, business_partners
, providers
, and so on.
Linked tables
You can think of linked tables as events and time-series data or tables in which each record represents a specific event or a point in time. You define a table as a linked table when you set a Context foreign key from it to a subject table.
Examples of such tables can be order
, purchase
, events
and so on.
Reference tables
Before v110, MOSTLY AI categorized as reference tables the ones that contained entities with non-private information.
Examples of such tables were country
, region
, district
, and so on.
From v110, MOSTLY AI no longer recognizes such tables as reference tables. Any tables that you add to a new synthetic dataset always begin as subject tables.
If you created catalogs with MOSTLY AI v109 or earlier, you might still have reference tables.
However, you can no longer delete reference tables or edit their settings. If you still have reference tables in older catalogs, bear in mind that MOSTLY AI does not copy reference tables to the generated synthetic dataset.
Relationship types
MOSTLY AI supports two types of foreign keys .
Context foreign key
You add a Context foreign key to a table to define the table as linked to the subject table to which the Context foreign key points. With a Context foreign key, MOSTLY AI can fully retain the correlations between the subject and linked tables. This means that any correlations that exist within each table and between the columns of both tables are fully retained in the synthetic data.
Also, Context foreign keys guarantee the referential integrity between the linked and subject tables.
You can have only one Context foreign key per linked table. If you need multiple foreign keys per linked table, use a Context foreign key for the subject table the correlations of which you want to fully retain in the synthetic data, and Smart Select foreign keys for all others.
Smart Select foreign key
Use Smart Select foreign keys to configure Smart Select relationships between two tables. A Smart Select relationship aims to retain the correlations between the two tables on a "best-effort" basis but it cannot achieve full retention.
To give Smart Select a more meaningful way to retain the correlations between two tables, you can define columns from the referenced table as Smart Select columns and rank them by importance.
A Smart Select relationship also helps to fully retain the referential integrity between the two tables.
Reference foreign key
Before v110, reference foreign keys established the relationship between reference tables and another table in a new synthetic dataset.
From v110, reference tables and reference foreign keys are no longer available.
Relationship schemas
The sections below review the table setups that you might have in your original data and how MOSTLY AI creates synthetic data in the each setup, as well as how the privacy, correlations, and referential integrity are retained for each scenario.
Each setup is illustrated with a diagram. The legend below shows the elements that are part of each diagram.
Single subject table
In this scenario, you have a single subject table. The table might have the attributes listed in the table below.
In this scenario, MOSTLY AI fully retains the correlations between columns in the subject table.
Two tables
Referential integrity is preserved | |
Correlations between all columns are retained | |
Correlations between events of the same customer are retained | |
Privacy of customers is protected |
Three tables - Star schema
Referential integrity is preserved | |
Correlations between customers and clicks are retained | |
Correlations between customers and visits are retained | |
Correlations between clicks and visits are retained | |
Correlations between all clicks of the same customer are retained | |
Correlations between all visits of the same customer are retained | |
Correlations between events of the same customer are retained | |
Privacy of customers is protected |
Three tables - Nested schema
Referential integrity is preserved | |
Correlations between customers and accounts and between accounts and transactions are retained | |
Correlations between customers and transactions are retained | |
Correlations between all accounts of the same customer are retained | |
Correlations between all transactions of the same customer are retained | |
Correlations between transactions across accounts are NOT retained | |
Privacy of all entities is protected |
Graph schemas - Non-hierarchical relations
Referential integrity is preserved | |
Correlations between customers and logins are retained | |
Correlations between all logins of the same customer are retained | |
Correlations between logins and accounts are NOT retained. Smart Select columns can help to improve the correlations. | |
Correlations between logins to the same account are NOT retained. In particular, the cardinality of the relationship is NOT retained. | |
Privacy of all entities is protected |
Graph schemas - Self-relation
Referential integrity is preserved | |
Correlations between customers columns are retained | |
Correlations between customers and their mothers and fathers are NOT retained. Smart Select columns can help to improve the correlations. | |
Correlations between customers that belong to the same mother, or to the same father are NOT retained. | |
Privacy of customers is protected |