Guides
Encoding types

Encoding types

For each column in your original data, you can select an encoding type which defines how the generative AI model in MOSTLY AI trains on your data and how it generates your synthetic data. The platform will automatically detect the most common data types typically present in datasets: Numeric, Categorical, and Datetime. All other encoding types need to be selected manually.

Numeric

For numeric values in a column, you can select one of the available Numeric encoding types: Auto (default), Digit, Discrete, and Binned.

Auto

With Numeric:Auto, MOSTLY AI uses heuristics to decide the most appropriate Numeric encoding type based on the data in a column. For most cases, select Numeric:Auto or leave it select by default.

Digit

Digit applies the same encoding type as Numeric in previous versions of MOSTLY AI. It recognizes the data in the column as numeric values. You can disable or keep Extreme value protection enabled for columns with Digit encoding type.

MOSTLY AI can synthesize floating-point values with a precision of up to 8 digits after the decimal point.

Discrete

Discrete treats the numeric data in the column as categorical values. You can use this option for columns that have categorical numeric codes, such as:

  • ZIP codes, postal codes, country phone codes
  • binary True or False that are represented as the numeric values 0 and 1
  • any categorical data which are represented with numeric values

Extreme value protection and Rare category protection are ignored for columns with Discrete encoding type.

Binned

You can use Binned for columns containing large integers or long decimals and, as a result, produce long training times that were otherwise needed when you previously selected Numeric (now renamed to Digit). MOSTLY AI bins the numerical values into 100 bins and considers each a category during training. During generation, MOSTLY AI samples values from the corresponding bin to generate the synthetic values in the column.

Categorical

A categorical variable has a fixed set of possible values that are already present in the input data. An example of such a variable is T-shirt size, which could consist of the following categories: 'XS, S, M, L, XL'. The synthetic data will only contain categories that were present in the original data. Categorical variables thus prevent random values (for instance, 'XM, A, B') from appearing in your synthetic dataset.

If the automatic encoding type detection does not recognize a Numeric or Datetime column as such, it is encoded as Categorical.

The Rare category protection settings appear below the Encoding type section when selecting this type. These settings help with protecting rare categories. Such categories may cause re-identification of outliers among your data subjects if they’re present in the resulting synthetic data. If you want to create privacy safe synthetic data, you should keep rare category protection enabled.

There are two rare category protection methods available with which you can mask these categories:

Constant
Replaces rare categories with the value _RARE_.

The use of Constant will likely introduce a new category in your synthetic data and thus might have an impact on your downstream tasks. To avoid that, you can use the Sample method.

Sample
Replaces the rare categories with categories that will appear in the synthetic version of this column. The categories are sampled from the original data based on their frequency. The more frequent a category is, the more likely it will be selected.

Datetime

Datetime refers to values that contain a date part and a time part. This encoding type enables MOSTLY AI to synthesize them and generate valid and statistically representative dates and times.

The following formats are supported:

FormatExample
Dateyyyy-MM-dd2020-02-08
Datetime with hoursyyyy-MM-dd HH
yyyy-MM-ddTHH
yyyy-MM-ddTHHZ
2020-02-08 09
2020-02-08T09
2020-02-08T09Z
Datetime with minutesyyyy-MM-dd HH:mm
yyyy-MM-ddTHH:mm
yyyy-MM-ddTHH:mmZ
2020-02-08 09:30
2020-02-08T09:30
2020-02-08T09:30Z
Datetime with secondsyyyy-MM-dd HH:mm:ss
yyyy-MM-ddTHH:mm:ss
yyyy-MM-ddTHH:mm:ssZ
2020-02-08 09:30:26
2020-02-08T09:30:26
2020-02-08T09:30:26Z
Datetime with millisecondsyyyy-MM-dd HH:mm:ss.SSS
yyyy-MM-ddTHH:mm:ss.SSS
yyyy-MM-ddTHH:mm:ss.SSSZ
2020-02-08 09:30:26.123
2020-02-08T09:30:26.123
2020-02-08T09:30:26.123Z
💡

You will receive an error message during the encoding stage if your format doesn’t meet the criteria specified above. MOSTLY AI does not support the following formats:

  • Any format with a week number
    Example: 2020-W06-5 (Week 6, Day 5 of 2020)
  • Any format with ordinal dates.
    Example: 2020-039 (Day 39 of 2020)
  • Formats with a time zone offset that do not contain a Z
    Example: 2020-02-08 09+07:00
  • Short formats that do not contain any special characters, such as -, T, Z, etc.
    Example: 20200208T0930
  • Formats that separate seconds and milliseconds with a comma
    Example: 2020-02-08T09:30:26,123
  • Formats that separate seconds and milliseconds with a colon
    Example: 2020-02-08 09:30:26:123
  • Date only formats that have a time zone component
    Example: 2020-02-08Z

ITT

ITT, or Inter-Transaction Time, is an encoding type that models the time interval between two subsequent events in the synthetic dataset. This encoding type causes the time between events to become very accurate, but the dates become less accurate.

The ITT encoding type is only available for linked tables.

Latitude, Longitude

Use the Latitude, Longitude encoding type to synthesize geolocation coordinates.

MOSTLY AI requires a geolocation coordinate to be encoded in a single field with the latitude and longitude as comma-separated values. The latitude must be on the comma’s left side and the longitude on the right.

The values must be in decimal degrees format and range from -90 to 90 for latitude and -180 to 180 for longitude. Their precision cannot be larger than five digits after the decimal dot. This translates to an accuracy of approx. 1 meter. Any additional digits will be ignored.

Start locationEnd locationSome other location
70.31311, 150.1-90.0, 180.037.311, 173.8998
-39.0, 120.3311478.31112, -100.031-10.10, -80.901

For CSV files, wrap each coordinate in double quotes. To learn more, see CSV files requirements.

Character sequence

Use the Character sequence encoding type to synthesize short strings with a consistent pattern, such as phone numbers, license plate numbers, company ID’s, transaction ID, and social security ID’s.

Text

Use the Text encoding type to synthesize unstructured natural language texts up to 1,000 characters long.

You can use this encoding type to generate realistic, representative, and anonymous financial transaction texts, short user feedback, medical assessments, PII fields, etc. As the resulting synthetic texts are representative of the terms, tokens, and their co-occurrence in the original data, they can be confidently used in analytics and machine learning use cases, such as sentiment analysis and named-entity recognition. Even though they might look noisy and not very human-readable, they will work perfectly for these use cases.

❗️

Our privacy and accuracy tests cannot detect potential leakages of protected rare categories or measure how representative the resulting synthetic texts are.

Our text synthesis model is language-agnostic and doesn’t contain the biases of some pre-trained models—any content is solely learned from the original training data. This means that it can process any language, vernacular, and slang present in the original data.

The amount of data required to produce usable results depends on the diversity of the original texts' vocabulary, categories, names, etc. As a rule of thumb, the more structure there is, the fewer samples are needed.

The synthetic texts are generated in a context-aware manner—the messages from a teenager are different from those of an 85-year old grandmother, for instance. By considering the other attributes of a synthetic subject’s profile, MOSTLY AI is capable of synthesizing appropriate natural language texts for each of them.

Below, you can find two examples. The first example demonstrates MOSTLY AI’s ability to synthesize entirely new names from a multilingual dataset. And the second example shows the result of synthesizing Tripadvisor reviews. Here you can see that the resulting texts accurately retain the context of the establishment they discuss (Restaurant or Hotel) and the synthesized rating.

Multilingual names dataset


OriginalSynthetic
   Nationality     Name
1: Czech           Svoboda
2: Greek           Chrysanthopoulos
3: Spanish         Ventura
4: Russian         Gagarin
5: Japanese        Yokoyama
6: English         Parsons
7: Spanish         Ruiz
8: Russian         Chekhov
9: English         Blake
10: English         Wigley
    Nationality     Name
 1: English         Olsewood
 2: German          Kort
 3: Japanese        Misaghi
 4: English         Roger
 5: Russian         Lusov
 6: Russian         Zhuszenko
 7: Japanese        Noraghi
 8: English         Dalman
 9: Russian         Michov
10: Polish          Poskan
11: Arabic          Shaif

Tripadvisor reviews


Original
   Establishment    Rating  Review
1: Restaurant       6       Not bad, great interior but let down by
                            unimaginative food. Perfectly good for a
                            quick lunch or drink though, good ales
2: Hotel            2       Awful!!!! stunk of smoke guttering outside
                            window ledge filled with cigarette ends and
                            bottles. NOISY air con unit in room, husband
                            had no sleep.
3: Restaurant       8       Helpful staff, pleasant enough with quick
                            service. Sat at bar by revolving food server.
                            Everything seemed nice and fresh. Good value.
4: Hotel            8       We stayed in a standard room at the hotel.
                            The room was adequate, though a bit short on
                            cupboard drawer space.
5: Hotel            4       Expected much more from here and they just
                            didn't deliver, for the price of the room it
                            was no different than any of the other cheaper
                            aparthotels.
Synthetic
   Establishment    Rating  Review
1: Restaurant       8       My only complaint are the portion sizes. Lovely
                            restaurant with good food, though.
2: Restaurant       8       I'm a year-round regular. Service is really
                            friendly. The starters are OK and the seafood
                            buffet is amazing and tasty. Overall a nice menu
                            throughout and our children love it.
3: Hotel            8       This is a fantastic hotel. Great food but few
                            options, a brilliant room and spent an excellent
                            time. Very clean environment and a high level of
                            service.
4: Hotel            2       I booked an offer for a spa day. The food was
                            below-average, the room was dated, smelled of
                            fried fish, and the staff has an attitude.
5: Hotel            2       We've been here before. But for £70, the rooms
                            are still poor and glamourless. We spent a few
                            days with 5 people and 4 were not impressed.