A type of synthesization job that processes Parquet and CSV files directly uploaded by the user.
A document that describes the structure of the contents of a data source — how many tables are there, how are they related, which columns do they have, and how should they be synthesized?
When you create a catalog, MOSTLY AI connects to the data source you specified, extracts information about the data inside, and sets them up for repeatable synthetic data generation.
Using catalogs for synthetic data generation has the following benefits:
They simplify the management and reuse of synthesization settings.
They allow for the contents to change, allowing you to share the latest, up-to-date synthetic copies of that data source.
They can be fully automated and integrated into your data pipelines using MOSTLY AI’s API.
The Coherence tab in the Model QA report and Data QA report is available only for time-series data in linked tables. The Coherence tab contains bivariate plots that show the autocorrelations in time series data and compares those between the original and the synthetic data.
A connector is an object in MOSTLY AI that you create to define connections to databases and cloud storage buckets. You can use any created connector as a data source or data destination where MOSTLY AI can automatically deliver the generated synthetic data.
When synthesizing data, MOSTLY AI must ensure the privacy, accuracy, and referential integrity of your data. To do so, MOSTLY AI classifies the relationships between your tables into Context and Smart Select relationships.
Context relationships are critical for the privacy and accuracy of the synthetic data.
In a two-table scenario, where you might have one subject and one linked table, it is a common scenario to have a relationship between the tables with a foreign key in the linked table that points to the primary key of the subject table. In such cases, MOSTLY AI identifies this as the context relationship. The term "context" refers to the fact that during synthesis the synthetic data in the linked table is generated in the context of the subject table.
In a multi-table scenario, if you have a linked table that has foreign keys to two subject tables, you can define only one of these relationships as the context. This ensures that the synthetic data fully retains the referential integrity and the correlations between the linked and context subject table. For the second subject table, you can define the Smart Select relationship which retains the referential integrity between the two tables and can only retain their correlations on a best effort basis.
The process of creating a synthetic version that contains more entries than the original.
The entities whose privacy you want to protect.
Extreme values are the smallest and largest values in a distribution. They can occur in columns with Numeric, Datetime, ITT, Latitude, Longitude encoding types, but also the number of linked records belonging to each data subject. Here you can think of the length of their lists, sequences, or time-series data.
This is a feature that lets you generate more data from a completed synthesization job. You can generate more data by specifying how many data subject you want to generate, or if the job synthesized a subject table and a linked table, you can upload another subject table to generate its linked table counterpart, as long as the table’s columns and data types are identical.
A job type that reuses a readily trained AI model using a subject table as a seed.
A job type that reuses a readily trained AI model to generate a specified number of subjects.
The subjects that will appear in the synthetic data.
A table with one or more foreign keys that refer to subject tables or other linked tables and commonly contain lists, sequences, or time-series properties of the subjects whose privacy you want to protect.
In data warehouses, they can be analogous to the concept of facts while subject tables can be analogous to dimensions or slowly-changing dimensions.
MOSTLY AI allows you to generate mock data instead of AI-powered synthetic data. The difference between the two is that Mock data is not modeled after the original data. Instead, it’s random data that is generated within the constraints of a data type and format you specify.
You can use it to generate test data, particularly if you need strings with a consistent pattern, such as phone numbers, license plate numbers, company IDs, transaction IDs, and social security IDs.
The input or source data used for synthesization.
Data that you can manipulate and control when generating synthetic data from original data. During this process, the statistical features of the data can be rebalanced, imputed, or have a stricter or looser adherence to the detected distributions and correlations. This allows you to improve downstream ML model performance, simulate what-if scenarios, or generate test data that has an improved ability to reveal defects in your software. It can also be useful for analytical purposes and data exploration.
A category that seldomly occurs in columns with the categorical encoding type, thereby posing a re-identification risk for data subjects.
A table that does not contain data subjects (that must be privacy-protected) or information about them.
MOSTLY AI processes data from Reference tables based on relationships with other tables. Reference tables are not copied to the synthetic dataset. This helps to prevent any potential data leaks.
A relationship that references a reference table.
When synthesizing databases, MOSTLY AI must ensure the privacy, accuracy, and referential integrity of the synthetic data. To do so, MOSTLY AI analyzes the schema and classifies the relationships into Context and Smart Select relationships.
In a multi-table scenario, if you have a linked table that has foreign keys to two subject tables, you can define one of these relationships as the Context. This ensures that the synthetic data fully retains the referential integrity and the correlations between the linked and context subject tables. For the second subject table (or even third, fourth, and so on), you can define a Smart Select relationship which helps to fully retain the referential integrity between the two tables and only retains their correlations on a best effort basis but not completely.
You can specify one or more columns as Smart Select columns in the second subject table. MOSTLY AI then uses the Smart Select columns to find appropriate matches with the entries in the linked table of that relationship. The goal is to achieve a more accurate rendering of the relationships in the generated synthetic dataset.
A table that contains profile information of the subjects whose privacy you want to protect. Its attributes can describe their name, gender, height, place of residence, income, etc.
In data warehouses, they can be analogous to the concept of dimensions or slowly-changing dimensions while linked tables can be analogous to facts.
Artificially generated information that can be used in place of real historic data.
The data subjects and their linked data that will be used for training the AI-model used for synthetic data generation.