Using the Table details tab, you can optionally configure how MOSTLY AI will process your database’s columns during synthetization. Here you can include or exclude columns, tailor their processing to your use case, and replace the contents of certain types of columns with mock data.

Each column in your database has a generation method assigned. This method refers to how a column will be rendered to the synthetic dataset and indicates the available configuration options.

MOSTLY AI preconfigured these generation methods on the basis of the subject table classification that you specified in Step 5. The table below sets out which generation methods there are, the table roles they appear in, and the configuration options that are available. To learn more about a configuration option, click on its name to jump to the corresponding section on this page.

All generation methods are fixed, except for AI-powered generation, which you can change to Mock data.
Generation method Behavior Roles Configuration options

AI-powered generation

Uses the column for AI-powered synthetic data generation.

  • Subject

  • Linked

Context foreign key

Links the entries in this table to their corresponding entries in the indicated parent table.

  • Linked

  • No available options

Copy

Copies the column from the original database to the synthetic database.

  • Reference

  • No available options

Mock data

Generates random data within the constraints of the configured data type and format.

  • Subject

  • Linked

Primary key ID

Generates new primary key ID’s for the synthetic version of the table.

  • Subject

  • Linked

  • Sequential

  • UUID

  • UUID no hyphen

  • Hash

Reference foreign key

Links the entries in this table to their corresponding entries in the indicated reference table.

  • Subject

  • Linked

  • No available options

Smart Select foreign key

Links the entries in this table to entries in the indicated parent table.

  • Subject

  • Linked

  • Smart Select columns

Columns that refer to reference tables—indicated by the Reference foreign key generation method—may contain rarely used foreign key values. To prevent the reidentification of your original data subjects from the synthetic data, these values will be replaced with values that are not rare.

To explain this further, let’s assume we have a Customers table with a Job-ID column in which one of its values refers to the job Prime minister. This value will be replaced with one that refers to a non-rare job, for instance, Accountant.

In the event that none of the values in a Reference foreign key column appear frequent enough to be considered non-rare, none of them will appear in the synthetic version of this column, thus breaking the relationship’s referential integrity.

Browsing your database’s columns

Navigating to a specific column in your database is very straightforward. The Table details tab is divided into two panes. The left pane lists the tables you selected in Step 5, and the right pane lists the columns of these tables.

Each row in this list shows the column name and generation method. Here, you can also include or exclude the column from appearing in the synthetic database, and by clicking on the gear icon, you can open the Column parameters.

Configuring the encoding types of AI-powered generation columns

Ai-powered generation

To configure or change an encoding type, click on the gear icon of an AI-powered generation column to open its column parameters. Having these encoding types configured correctly is essential for accurately training the synthetic data generation model.

If you want to generate random data instead of synthetic data, you can switch the generation method to Mock data. To learn more about the available mock data types and configuration options, click here to jump to the corresponding section on this page.

Below, you’ll find an overview of available encoding types and how to configure them.


Auto-detect

Use the Auto-detect setting to let MOSTLY AI decide the appropriate encoding type when the synthetization job starts.

You can look up which encoding type MOSTLY AI has chosen when the job is running. These details will appear in the Job summary and can be accessed by clicking on the table’s kebab icon and selecting View column details.

Auto-detect 1

You’ll then see a list similar to the one below.

Auto-detect 2


Categorical

A categorical variable has a fixed set of possible values that are already present in the input data. An example of such a variable is T-shirt size, which consists of the following categories: 'XS, S, M, L, XL'. Categorical variables prevent random values (for instance, 'XM, A, B') from appearing in your synthetic dataset.

The Rare category protection settings appear below the Encoding type section when selecting this type. These settings help with protecting rare categories. Such categories may cause re-identification of outliers among your data subjects if they’re present in the resulting synthetic data.

There are two rare category protection methods available with which you can mask these categories:

Constant

Replaces rare categories with the value _RARE_.

Sample

Replaces the rare categories with the categories that will appear in the synthetic version
of this column.

Select the method that’s best suited for your use case.

For categorical columns in linked tables, there’s also the Consistency correction setting. Activating consistency correction lets you bias this column towards having less diversity in your data subjects' sequences.

consistency correction

It helps the model remember all the previously generated values of a categorical column, so it can learn whether it needs to boost the probability of a particular category given that it has already been generated for a given subject.

To clarify how this works, let’s consider synthesizing a dataset containing customers' grocery purchases in a supermarket over time. Supermarkets carry about 40.000 items on average, whereas customers tend to consistently buy more or less the same products each time they visit.

When synthesizing the dataset without consistency correction, the number of unique products per customer may become too large. The model will assign a small probability to each of the 40.000 products, summing up to an exaggerated probability of buying some new items.

By activating consistency correction, you’re shifting the model’s predictions in favor of items that previously appeared in that customer’s sequence—as a result, reducing the probability of new items being purchased every visit.

Consistency correction helps with high cardinality columns. In this dataset, this would be the column with purchased items. The total number of unique items in this column would probably come close to the variety carried by the supermarket. However, consistency correction can already be helpful even if a column’s cardinality is only 50 - 100, for as long as the column should have a high consistency over time.

Enabling consistency correction increases memory and computational requirements. We recommend turning it on only when there’s a real need for it.


Datetime

Datetime refers to values that contain a date part and a time part. This encoding type enables MOSTLY AI to synthesize them and generate valid and statistically representative dates and times.

The following formats are supported:

Format Example

Date

yyyy-MM-dd

2020-02-08

Datetime with hours

yyyy-MM-dd HH
yyyy-MM-ddTHH
yyyy-MM-ddTHHZ

2020-02-08 09
2020-02-08T09
2020-02-08T09Z

Datetime with minutes

yyyy-MM-dd HH:mm
yyyy-MM-ddTHH:mm
yyyy-MM-ddTHH:mmZ

2020-02-08 09:30
2020-02-08T09:30
2020-02-08T09:30Z

Datetime with seconds

yyyy-MM-dd HH:mm:ss
yyyy-MM-ddTHH:mm:ss
yyyy-MM-ddTHH:mm:ssZ

2020-02-08 09:30:26
2020-02-08T09:30:26
2020-02-08T09:30:26Z

Datetime with milliseconds

yyyy-MM-dd HH:mm:ss.SSS
yyyy-MM-ddTHH:mm:ss.SSS
yyyy-MM-ddTHH:mm:ss.SSSZ

2020-02-08 09:30:26.123
2020-02-08T09:30:26.123
2020-02-08T09:30:26.123Z

You will receive an error message during the encoding stage if your format doesn’t meet the criteria specified above. MOSTLY AI does not support the following formats:

  • Any format with a week number.
    Example: 2020-W06-5 (Week 6, Day 5 of 2020)

  • Any format with ordinal dates.
    Example: 2020-039 (Day 39 of 2020)

  • Formats with a time zone offset that don’t contain a Z
    Example: 2020-02-08 09+07:00

  • Short formats that do not contain any special characters, such as -, T, Z, etc.
    Example: 20200208T0930

  • Formats that separate seconds and milliseconds with a comma.
    Example: 2020-02-08T09:30:26,123

  • Formats that separate seconds and milliseconds with a colon.
    Example: 2020-02-08 09:30:26:123

  • Date only formats that have a time zone component.
    Example: 2020-02-08Z


Numeric

Use the Numeric encoding type to synthesize numerical values that may vary, such as weight and height.

MOSTLY AI can synthesize floating-point values with a precision of up to 8 digits after the decimal point.


Latitude, Longitude

Use the Latitude, Longitude encoding type to synthesize geolocation coordinates.

MOSTLY AI requires a geolocation coordinate to be encoded in a single field with the latitude and longitude as comma-separated values. The latitude must be on the comma’s left side and the longitude on the right.

The values must be in decimal degrees format and range from -90 to 90 for latitude and -180 to 180 for longitude. Their precision cannot be larger than five digits after the decimal dot. Any additional digits will be ignored.

The table below shows a use case with three geolocation columns.

Start location End location Some other location

70.31311, 150.1

-90.0, 180.0

37.311, 173.8998

-39.0, 120.33114

78.31112, -100.031

-10.10, -80.901


Text

Use the Text encoding type to synthesize unstructured natural language texts up to 1000 characters long.

You can use this encoding type to generate realistic, representative, and anonymous financial transaction texts, short user feedback, medical assessments, PII fields, etc. As the resulting synthetic texts are representative of the terms, tokens, and their co-occurrence in the original data, they can be confidently used in analytics and machine learning use cases, such as sentiment analysis and named-entity recognition. Even though they might look noisy and not very human-readable, they will work perfectly for these use cases.

Our privacy and accuracy tests cannot detect potential leakages of protected rare categories or measure how representative the resulting synthetic texts are.

Our text synthetization model is language-agnostic and doesn’t contain the biases of some pre-trained models—any content is solely learned from the original training data. This means that it can process any language, vernacular, and slang present in the original data.

The amount of data required to produce usable results depends on the diversity of the original texts' vocabulary, categories, names, etc. As a rule of thumb, the more structure there is, the fewer samples are needed.

The synthetic texts are generated in a context-aware manner—the messages from a teenager are different from those of an 85-year old grandmother, for instance. By considering the other attributes of a synthetic subject’s profile, MOSTLY AI is capable of synthesizing appropriate natural language texts for each of them.

Below, you can find two examples. The first example demonstrates MOSTLY AI’s ability to synthesize entirely new names from a multilingual dataset. And the second example shows the result of synthesizing Tripadvisor reviews. Here you can see that the resulting texts accurately retain the context of the establishment they discuss (Restaurant or Hotel) and the synthesized rating.

Multilingual names dataset

Original Synthetic
    Nationality     Name
 1: Czech           Svoboda
 2: Greek           Chrysanthopoulos
 3: Spanish         Ventura
 4: Russian         Gagarin
 5: Japanese        Yokoyama
 6: English         Parsons
 7: Spanish         Ruiz
 8: Russian         Chekhov
 9: English         Blake
10: English         Wigley
    Nationality     Name
 1: English         Olsewood
 2: German          Kort
 3: Japanese        Misaghi
 4: English         Roger
 5: Russian         Lusov
 6: Russian         Zhuszenko
 7: Japanese        Noraghi
 8: English         Dalman
 9: Russian         Michov
10: Polish          Poskan
11: Arabic          Shaif

Tripadvisor reviews

Original
   Establishment    Rating  Review
1: Restaurant       6       Not bad, great interior but let down by
                            unimaginative food. Perfectly good for a
                            quick lunch or drink though, good ales!
2: Hotel            2       Awful!!! stunk of smoke! guttering outside
                            window ledge filled with cigarette ends and
                            bottles. NOISY air con unit in room, husband
                            had no sleep.
3: Restaurant       8       Helpful staff, pleasant enough with quick
                            service. Sat at bar by revolving food server.
                            Everything seemed nice and fresh. Good value.
4: Hotel            8       We stayed in a standard room at the hotel.
                            The room was adequate, though a bit short on
                            cupboard/ drawer space.
5: Hotel            4       Expected much more from here and they just
                            didn't deliver, for the price of the room it
                            was no different than any of the other cheaper
                            aparthotels.
Synthetic
   Establishment    Rating  Review
1: Restaurant       8       My only complaint are the portion sizes. Lovely
                            restaurant with good food, though.
2: Restaurant       8       I'm a year-round regular. Service is really
                            friendly. The starters are OK and the seafood
                            buffet is amazing and tasty. Overall a nice menu
                            throughout and our children love it.
3: Hotel            8       This is a fantastic hotel. Great food but few
                            options, a brilliant room and spent an excellent
                            time. Very clean environment and a high level of
                            service.
4: Hotel            2       I booked an offer for a spa day. The food was
                            below-average, the room was dated, smelled of
                            fried fish, and the staff has an attitude.
5: Hotel            2       We've been here before. But for £70, the rooms
                            are still poor and glamourless. We spent a few
                            days with 5 people and 4 were not impressed.


Character sequence

Use the Character sequence encoding type to synthesize short strings with a consistent pattern, such as phone numbers, license plate numbers, company ID’s, transaction ID, and social security ID’s.


ITT

ITT, or Inter-Transaction Time, is an encoding type that models the time interval between two subsequent events in the synthetic dataset. This encoding type causes the time between events to become very accurate, but the dates become less accurate.

You can select this encoding type for only one column with date and time information in your linked tables.
You will receive an error message during the encoding stage if you select `ITT`for a subject table column, multiple columns, or a column that doesn’t contain date or time information in the supported datetime formats..


Generating mock data instead of synthetic data

Instead of synthetic data, you can also choose to generate mock data — random data that is generated within the constraints of a configured data type and format. To do so, click on the gear icon of an AI-powered generation column, and change the generation method to Mock data.

MOSTLY AI can produce random numbers, names, addresses, and other personal details. If you want more precise control over the output, then you can select the Custom string option.

By creating a string pattern and providing a character range, you can generate random, but true-to-life phone numbers, transaction IDs, license plates, or any other type of information that is structured as a series of digits and letters.

Below, you’ll find an overview of available mock data types and formats.

Data type Format Description

Person

Full name

Generates random full names from the English language.

Examples:

'Norma Fisher'
'Jorge Sullivan'
'Elizabeth Woods'
'Susan Wagner'
'Peter Montgomery'

First name

Generates random first names from the English language.

Examples:

'Megan'
'Katherine'
'Robert'
'Jonathan'
'William'

Last name

Generates random last names from the English language.

Examples:

'Richard'
'Chang'
'Fisher'
'Green'
'Dixon'

Bank IBAN

No available format

Generates random, life-like IBAN bank account numbers. They conform to the official format, are of the expected length, and have a valid checksum.

Generated IBANs that turn out to be valid in real life are purely coincidental.

Examples:

'GB84MYNB48764759382421'
'GB13TZIR92411578156593'
'GB51RPOQ40801609753513'
'GB03SHHZ28711587148418'
'GB56KRGZ98947196593423'

Email

No available format

Generates random emails.

Each email address contains the domain name of a free email service provider.

'tamaramorrison@hotmail.com'
'martha10@hotmail.com'
'deborah64@gmail.com'
'franklinjames@yahoo.com'
'samanthasims@yahoo.com'

Address

Full name

Generates random random address details.

Each full address is formatted as a string containing a street name and house number, place of residence, and postal code.

Examples:

'48764 Howard Forge Apt. 421\nVanessaside, PA 19763'
'578 Michael Island\nNew Thomas, NC 34644'
'60975 Jessica Squares\nEast Sallybury, FL 71671'
'8714 Mann Plaza\nLisaside, PA 72227'
'96593 White View Apt. 094\nJonesberg, FL 05565'

City

Generates random cities.

Examples:

'Changchester'
'West Tammyfort'
'Hullport'
'Howardborough'
'West Donald'

Country

Generates random countries.

Examples:

'Tanzania'
'Hungary'
'Senegal'
'Tuvalu'
'Italy'

Number

No available format

Generates random numbers.

Set the range by specifying the minimum and maximum values. To specify the precision, enter the number of digits after the decimal dot in the precision field.

Custom string

String pattern

Generates random strings based on a string pattern.

Format your string using the # and ? symbols as placeholders for random values:

  • Number signs (#) are replaced with a random digit (0 to 9).

  • Question marks (?) are randomly drawn from the characters you entered in the Character range field.


Usage Examples:

'Telephone: +1 (###) ###-####'
'Company ID: Company ????????'
'Transaction ID: ???-####-?######'

Constant

No available format

Writes a constant value.
Enter the value to be written in the input field.

Row number

No available format

Writes the row numner.

Setting the format of a Primary key ID

Ai-powered generation

Primary keys are unique identifiers for each entry in a table and can come in different formats. MOSTLY AI can generate primary keys in sequential, UUID version 1—either with or without hyphens, or a proprietary hash format. Please select the format that’s present in the original data.

Examples:

UUID: 7b20cc44-da3b-11eb-810e-acde48001122
UUID-no-hyphen: 7b20cc44da3b11eb810eacde48001122