Using the Column details
tab, you can optionally configure how MOSTLY AI will process your database’s columns during synthetization. Here you can include or exclude columns, tailor their processing to your use case, and replace the contents of certain types of columns with mock data.
Each column in your database has a generation method assigned. This method refers to how a column will be rendered to the synthetic dataset and indicates the available configuration options.
MOSTLY AI preconfigured these generation methods on the basis of the subject table classification that you specified in Step 5. The table below sets out which generation methods there are, the table roles they appear in, and the configuration options that are available. To learn more about a configuration option, click on its name to jump to the corresponding section on this page.
All generation methods are fixed, except for AI-powered generation, which you can change to Mock data. |
Generation method | Behavior | Roles | Configuration options |
---|---|---|---|
AI-powered generation |
Uses the column for AI-powered synthetic data generation. |
|
|
Context foreign key |
Links the entries in this table to their corresponding entries in the indicated parent table. |
|
|
Copy |
Copies the column from the original database to the synthetic database. |
|
|
Mock data |
Generates random data within the constraints of the configured data type and format. |
|
|
Primary key ID |
Generates new primary key ID’s for the synthetic version of the table. |
|
|
Smart Select foreign key |
Links the entries in this table to entries in the indicated parent table. |
|
|
Browsing your database’s columns
Navigating to a specific column in your database is very straightforward. The Column details
tab is divided into two panes. The left pane lists the tables you selected in Step 5, and the right pane lists the columns of these tables.
Each row in this list shows the column name and generation method. Here, you can also include or exclude the column from appearing in the synthetic database, and by clicking on the gear
icon, you can open the Column parameters
.
Configuring the encoding types of AI-powered generation columns

To configure or change an encoding type, click on the gear
icon of an AI-powered generation
column to open its column parameters. Having these encoding types configured correctly is essential for accurately training the synthetic data generation model.
If you want to generate random data instead of synthetic data, you can switch the generation method to Mock data
. To learn more about the available mock data types and configuration options, click here to jump to the corresponding section on this page.
Below, you’ll find an overview of available encoding types and how to configure them.
Numeric
Use the Numeric encoding type to synthesize numerical values that may vary, such as weight and height.
The Precision
field appears when you select this type. Here, you can optionally limit the maximum number of digits after the decimal point used during training. The values will be rounded to the number of digits that you specified.
Specifying the Precision option for the Numeric encoding type can save computational resources.By default, MOSTLY AI considers all digits to be relevant for synthetic data generation. Reducing the precision speeds up the synthesization process when the original data suggests a high precision, but where those digits are not particularly relevant. Fractional numbers, such as 0.333333333… are an example of such high precision numbers. |
Categorical
A categorical variable has a fixed set of possible values that are already present in the input data. An example of such a variable is T-shirt size
, which consists of the following categories: 'XS, S, M, L, XL'. Categorical variables prevent random values (for instance, 'XM, A, B') from appearing in your synthetic dataset.
The Rare category protection
settings appear below the Encoding type
section when selecting this type. These settings help with protecting rare categories. Such categories may cause re-identification of outliers among your data subjects if they’re present in the resulting synthetic data.
There are two rare category protection methods available with which you can mask these categories:
Constant |
Replaces the rare categories with a constant value. The default constant value is |
Sample data |
Replaces the rare categories with the categories that will appear in the synthetic version |
With the Threshold
parameter, you can specify when a category is considered rare.
A Threshold
of 20 means that if a category is only present at 20 or fewer subjects, this category will be masked using the method you specified
The two charts below demonstrate how this works using the Baseball dataset. This dataset contains the records of 19,000 professional baseball players from 57 different countries, describing their country of origin, name, weight, height, etc.
Baseball is a popular sport in the U.S.A and some other countries in that region of the world. It’s therefore rare to find professional baseball players in European or Asian countries. The below chart — depicting the original dataset — reflects this distribution of baseball players. Over 16,000 baseball players come from the U.S.A, whereas there are only a few players in Belgium, Austria, or the Philippines.

You can prevent these baseball players from being identified by their country of origin by masking these categories. The below chart — depicting the synthetic dataset — shows the result of replacing them with the *
label. Of the 58 countries in the original dataset are only 16 visible in the synthetic dataset. The subjects of the remaining 43 countries are in the new *
category, preventing the re-identification of that sole baseball player in Greece, Indonesia, or Singapore.

Another use case for the categorical encoding type is postal codes (or ZIP codes). Specifying them as categorical rather than as a numeric column makes a big difference in the synthetic data generation process.
Most countries use numeric postal code systems. Only a few in the world are alphanumeric. MOSTLY AI automatically detects these as categories. With numeric systems, it’s likely to assign the Numeric encoding type.
We highly recommend verifying the encoding type for your postal code column. If the synthesization process uses the Numeric
encoding type, unique postal codes may appear that aren’t present in the original dataset.
Setting the threshold lower than 20 may introduce privacy risks. |
Aside from privacy protection, setting this value reduces computational resources for high-cardinality columns. |
Datetime
Datetime refers to values that contain a date part and a time part. This encoding type enables MOSTLY AI to synthesize them and generate valid and statistically representative dates and times.
The following formats are supported:
Format | Example | |
---|---|---|
Date |
|
2020-02-08 |
Datetime with hours |
|
2020-02-08 09 |
Datetime with minutes |
|
2020-02-08 09:30 |
Datetime with seconds |
|
2020-02-08 09:30:26 |
Datetime with milliseconds |
|
2020-02-08 09:30:26.123 |
You will receive an error message during the encoding stage if your format doesn’t meet the criteria specified above. MOSTLY AI does not support the following formats:
|
Text
Use the Text encoding type to synthesize unstructured natural language texts up to 1000 characters long and for a maximum of 1 column per table.
You can use this encoding type to generate realistic, representative, and anonymous financial transaction texts, short user feedback, medical assessments, PII fields, etc. As the resulting synthetic texts are representative of the terms, tokens, and their co-occurrence in the original data, they can be confidently used in analytics and machine learning use cases, such as sentiment analysis and named-entity recognition. Even though they might look noisy and not very human-readable, they will work perfectly for these use cases.
Our privacy and accuracy tests cannot detect potential leakages of protected rare categories or measure how representative the resulting synthetic texts are. |
Our text synthetization model is language-agnostic and doesn’t contain the biases of some pre-trained models—any content is solely learned from the original training data. This means that it can process any language, vernacular, and slang present in the original data.
The amount of data required to produce usable results depends on the diversity of the original texts' vocabulary, categories, names, etc. As a rule of thumb, the more structure there is, the fewer samples are needed.
The synthetic texts are generated in a context-aware manner—the messages from a teenager are different from those of an 85-year old grandmother, for instance. By considering the other attributes of a synthetic subject’s profile, MOSTLY AI is capable of synthesizing appropriate natural language texts for each of them.
Below, you can find two examples. The first example demonstrates MOSTLY AI’s ability to synthesize entirely new names from a multilingual dataset. And the second example shows the result of synthesizing Tripadvisor reviews. Here you can see that the resulting texts accurately retain the context of the establishment they discuss (Restaurant
or Hotel
) and the synthesized rating.
Multilingual names dataset
Original | Synthetic |
---|---|
|
|
Tripadvisor reviews
Original |
---|
|
Synthetic |
---|
|
ITT
ITT, or Inter-Transaction Time, is an encoding type that models the time interval between two subsequent events in the synthetic dataset. This encoding type causes the time between events to become very accurate, but the dates become less accurate.
You can select this encoding type for only one column with date and time information in your linked tables. |
You will receive an error message during the encoding stage if you select `ITT`for a subject table column, multiple columns, or a column that doesn’t contain date or time information in the supported datetime formats.. |
Latitude, Longitude
Use the Latitude, Longitude
encoding type to synthesize geolocation coordinates.
MOSTLY AI requires a geolocation coordinate to be encoded in a single field with the latitude and longitude as comma-separated values.
The values must be in decimal degrees format and range from -90
to 90
for latitude and -180
to 180
for longitude. Their precision cannot be larger than five digits after the decimal dot. Any additional digits will be ignored.
The table below shows a use case with three geolocation columns.
Start location | End location | Some other location |
---|---|---|
"70.31311, 150.1" |
"-90.0, 180.0" |
"37.311, 173.8998" |
"-39.0, 120.33114" |
"78.31112, -100.031" |
"-10.10, -80.901" |
When formatting the geolocation coordinates, please keep in mind that they must be enclosed in double quotes. If you want to learn more about content rules, please visit the Preparing your data section. |
Generating mock data instead of synthetic data
Instead of synthetic data, you can also choose to generate mock data — random data that is generated within the constraints of a configured data type and format. To do so, click on the gear
icon of an AI-powered generation
column, and change the generation method to Mock data
.
MOSTLY AI can produce random numbers, names, addresses, and other personal details. If you want more precise control over the output, then you can select the Custom string
option.
By creating a string pattern and providing a character range, you can generate random, but true-to-life phone numbers, transaction IDs, license plates, or any other type of information that is structured as a series of digits and letters.
Below, you’ll find an overview of available mock data types and formats.
Data type | Format | Description | ||
---|---|---|---|---|
Person |
Full name |
Generates random full names from the English language.
|
||
First name |
Generates random first names from the English language.
|
|||
Last name |
Generates random last names from the English language.
|
|||
Bank IBAN |
No available format |
Generates random, life-like IBAN bank account numbers. They conform to the official format, are of the expected length, and have a valid checksum.
Examples:
|
||
No available format |
Generates random emails.
|
|||
Address |
Full name |
Generates random random address details.
|
||
City |
Generates random cities.
|
|||
Country |
Generates random countries.
|
|||
Number |
No available format |
Generates random numbers. |
||
Custom string |
String pattern |
Generates random strings based on a string pattern.
|
||
Constant |
No available format |
Writes a constant value. |
||
Row number |
No available format |
Writes the row numner. |
Setting the format of a Primary key ID

Primary keys are unique identifiers for each entry in a table and can come in different formats. MOSTLY AI can generate primary keys in sequential, UUID version 1—either with or without hyphens, or a proprietary hash format. Please select the format that’s present in the original data.
Examples:
UUID: 7b20cc44-da3b-11eb-810e-acde48001122
UUID-no-hyphen: 7b20cc44da3b11eb810eacde48001122