Ad hoc job

A type of synthetization job that processes Parquet and CSV files directly uploaded by the user.


Catalog

A document that describes the structure of the contents of a data source — how many tables are there, how are they related, which columns do they have, and how should they be synthesized?

When you create a catalog, MOSTLY AI connects to the data source you specified, extracts information about the data inside, and sets them up for repeatable synthetic data generation.

Using catalogs for synthetic data generation has the following benefits:

  • They simplify the management and reuse of synthetization settings.

  • They allow for the contents to change, allowing you to share the latest, up-to-date synthetic copies of that data source.

  • They can be fully automated and integrated into your data pipelines using MOSTLY AI’s API.


Connector

An object that connects to data sources and contain the authentication details to do so.


Context relationship

When synthesizing data, MOSTLY AI must consider privacy security, synthetic data accuracy, and referential integrity. To do so, MOSTLY AI classifies your data’s relationships into Context and Smart Select relationships.

Context relationships are those that are critical for privacy security and synthetic data accuracy. The relationship’s primary and foreign keys are generated during the generation of subject and linked tables.

There are also relationships that are necessary for maintaining referential integrity. MOSTLY AI classifies these as Smart Select relationships and will be generated after the subject and linked tables are generated.


Data augmentation

The process of creating a synthetic version that contains more entries than the original.


Data subjects

The entities whose privacy you want to protect.


Extreme values

Extreme values are the smallest and largest values in a distribution.
They can occur in columns with Numeric, Datetime, ITT, Latitude, Longitude encoding types, but also the number of linked records belonging to each data subject. Here you can think of the length of their lists, sequences, or time-series data.


Generate more data

This is a feature that lets you generate more data from a completed synthetization job. You can generate more data by specifying how many data subject you want to generate, or if the job synthesized a subject table and a linked table, you can upload another subject table to generate its linked table counterpart, as long as the table’s columns and data types are identical.


Generate with seed

A job type that reuses a readily trained AI model using a subject table as a seed.


Generate with subject count

A job type that reuses a readily trained AI model to generate a specified number of subjects.


Generated subjects

The subjects that will appear in the synthetic data.


Linked tables

A table with one or more foreign keys that refer to subject tables or other linked tables and commonly contain lists, sequences, or time-series properties of the subjects whose privacy you want to protect.

In data warehouses, they can be analogous to the concept of facts while subject tables can be analogous to dimensions or slowly-changing dimensions.


Mock data

MOSTLY AI allows you to generate mock data instead of AI-powered synthetic data. The difference between the two is that Mock data is not modeled after the original data. Instead, it’s random data that is generated within the constraints of a data type and format you specify.

You can use it to generate test data, particularly if you need strings with a consistent pattern, such as phone numbers, license plate numbers, company IDs, transaction IDs, and social security IDs.


Original data

The input or source data used for synthetization.


Programmable data

Data that you can manipulate and control when generating synthetic data from original data. During this process, the statistical features of the data can be rebalanced, imputed, or have a stricter or looser adherence to the detected distributions and correlations. This allows you to improve downstream ML model performance, simulate what-if scenarios, or generate test data that has an improved ability to reveal defects in your software. It can also be useful for analytical purposes and data exploration.


Rare category

A category that seldomly occurs in columns with the categorical encoding type, thereby posing a re-identification risk for data subjects.


Reference table

A table that does not contain data subjects (that must be privacy-protected) or information about them.

MOSTLY AI processes data from Reference tables based on relationships with other tables. Reference tables are not copied to the synthetic dataset. This helps to prevent any potential data leaks.


Reference relationship

A relationship that references a reference table.


Smart Select relationship

When synthesizing databases, MOSTLY AI must consider privacy security, synthetic data accuracy, and referential integrity. To do so, MOSTLY AI analyzes the schema and classifies the relationships into Context and Smart Select relationships.

Smart Select relationships are necessary for maintaining the database’s referential integrity. These relationships will be generated after the subject and linked tables are generated.

By default, the entries in the referenced and referring tables of a Smart Select relationship will be randomly linked. That is to say, the foreign key column will be populated with randomly drawn IDs from the primary key.

You can change this behavior by designating one or more columns of the relationship’s referenced table as Smart Select columns. MOSTLY AI can then use these attributes to find appropriate matches with the entries in the referring table of a relationship. This will result in a more accurate rendering of these relationships in the resulting synthetic database.


Subject table

A table that contains profile information of the subjects whose privacy you want to protect. Its attributes can describe their name, gender, height, place of residence, income, etc.

In data warehouses, they can be analogous to the concept of dimensions or slowly-changing dimensions while linked tables can be analogous to facts.


Synthetic data

Artificially generated information that can be used in place of real historic data.


Training subjects

The data subjects and their linked data that will be used for training the AI-model used for synthetic data generation.