Ad hoc job |
A type of synthetization job that processes Parquet and CSV files directly uploaded by the user. |
Catalog |
A document that describes the structure of the contents of a data source — how many tables are there, how are they related, which columns do they have, and how should they be synthesized? When you create a catalog, MOSTLY AI connects to the data source you specified, extracts information about the data inside, and sets them up for repeatable synthetic data generation. Using catalogs for synthetic data generation has the following benefits:
|
Connector |
An object that connects to data sources and contain the authentication details to do so. |
Context relationship |
When synthesizing data, MOSTLY AI must consider privacy security, synthetic data accuracy, and referential integrity. To do so, MOSTLY AI classifies your data’s relationships into Context and Smart Select relationships. Context relationships are those that are critical for privacy security and synthetic data accuracy. The relationship’s primary and foreign keys are generated during the generation of subject and linked tables. There are also relationships that are necessary for maintaining referential integrity. MOSTLY AI classifies these as Smart Select relationships and will be generated after the subject and linked tables are generated. |
Data augmentation |
The process of creating a synthetic version that contains more entries than the original. |
Data subjects |
The entities whose privacy you want to protect. |
Extreme values |
Extreme values are the smallest and largest values in a distribution. |
Generate more data |
This is a feature that lets you generate more data from a completed synthetization job. You can generate more data by specifying how many data subject you want to generate, or if the job synthesized a subject table and a linked table, you can upload another subject table to generate its linked table counterpart, as long as the table’s columns and data types are identical. |
Generate with seed |
A job type that reuses a readily trained AI model using a subject table as a seed. |
Generate with subject count |
A job type that reuses a readily trained AI model to generate a specified number of subjects. |
Generated subjects |
The subjects that will appear in the synthetic data. |
Linked tables |
A table with one or more foreign keys that refer to subject tables or other linked tables and commonly contain lists, sequences, or time-series properties of the subjects whose privacy you want to protect. In data warehouses, they can be analogous to the concept of facts while subject tables can be analogous to dimensions or slowly-changing dimensions. |
Mock data |
MOSTLY AI allows you to generate mock data instead of AI-powered synthetic data. The difference between the two is that Mock data is not modeled after the original data. Instead, it’s random data that is generated within the constraints of a data type and format you specify. You can use it to generate test data, particularly if you need strings with a consistent pattern, such as phone numbers, license plate numbers, company IDs, transaction IDs, and social security IDs. |
Original data |
The input or source data used for synthetization. |
Programmable data |
Data that you can manipulate and control when generating synthetic data from original data. During this process, the statistical features of the data can be rebalanced, imputed, or have a stricter or looser adherence to the detected distributions and correlations. This allows you to improve downstream ML model performance, simulate what-if scenarios, or generate test data that has an improved ability to reveal defects in your software. It can also be useful for analytical purposes and data exploration. |
Rare category |
A category that seldomly occurs in columns with the categorical encoding type, thereby posing a re-identification risk for data subjects. |
Reference table |
A table that does not contain data subjects (that must be privacy-protected) or information about them. MOSTLY AI processes data from Reference tables based on relationships with other tables. Reference tables are not copied to the synthetic dataset. This helps to prevent any potential data leaks. |
Reference relationship |
A relationship that references a reference table. |
Smart Select relationship |
When synthesizing databases, MOSTLY AI must consider privacy security, synthetic data accuracy, and referential integrity. To do so, MOSTLY AI analyzes the schema and classifies the relationships into Context and Smart Select relationships. Smart Select relationships are necessary for maintaining the database’s referential integrity. These relationships will be generated after the subject and linked tables are generated. By default, the entries in the referenced and referring tables of a Smart Select relationship will be randomly linked. That is to say, the foreign key column will be populated with randomly drawn IDs from the primary key. You can change this behavior by designating one or more columns of the relationship’s referenced table as Smart Select columns. MOSTLY AI can then use these attributes to find appropriate matches with the entries in the referring table of a relationship. This will result in a more accurate rendering of these relationships in the resulting synthetic database. |
Subject table |
A table that contains profile information of the subjects whose privacy you want to protect. Its attributes can describe their name, gender, height, place of residence, income, etc. In data warehouses, they can be analogous to the concept of dimensions or slowly-changing dimensions while linked tables can be analogous to facts. |
Synthetic data |
Artificially generated information that can be used in place of real historic data. |
Training subjects |
The data subjects and their linked data that will be used for training the AI-model used for synthetic data generation. |