Auto-detect
Use this setting to let MOSTLY AI decide the appropriate encoding type when the synthesization job starts. You can look up which encoding type MOSTLY AI has chosen when the job is running. |
Categorical
A categorical variable has a fixed set of possible values that are already present in the input data. An example of such a variable is The There are two rare category protection methods available with which you can mask these categories:
Select the method that’s best suited for your use case. |
Datetime
Datetime refers to values that contain a date part and a time part. This encoding type enables MOSTLY AI to synthesize them and generate valid and statistically representative dates and times. The following formats are supported:
|
Numeric
Use the Numeric encoding type to synthesize numerical values that may vary, such as weight and height.
|
ITT
ITT, or Inter-Transaction Time, is an encoding type that models the time interval between two subsequent events in the synthetic dataset. This encoding type causes the time between events to become very accurate, but the dates become less accurate.
|
Latitude, Longitude
Use the MOSTLY AI requires a geolocation coordinate to be encoded in a single field with the latitude and longitude as comma-separated values. The latitude must be on the comma’s left side and the longitude on the right. The values must be in decimal degrees format and range from The table below shows a use case with three geolocation columns.
|
Character sequence
Use the Character sequence encoding type to synthesize short strings with a consistent pattern, such as phone numbers, license plate numbers, company ID’s, transaction ID, and social security ID’s. |
Text
Use the Text encoding type to synthesize unstructured natural language texts up to 1000 characters long. You can use this encoding type to generate realistic, representative, and anonymous financial transaction texts, short user feedback, medical assessments, PII fields, etc. As the resulting synthetic texts are representative of the terms, tokens, and their co-occurrence in the original data, they can be confidently used in analytics and machine learning use cases, such as sentiment analysis and named-entity recognition. Even though they might look noisy and not very human-readable, they will work perfectly for these use cases.
Our text synthesization model is language-agnostic and doesn’t contain the biases of some pre-trained models—any content is solely learned from the original training data. This means that it can process any language, vernacular, and slang present in the original data. The amount of data required to produce usable results depends on the diversity of the original texts' vocabulary, categories, names, etc. As a rule of thumb, the more structure there is, the fewer samples are needed. The synthetic texts are generated in a context-aware manner—the messages from a teenager are different from those of an 85-year old grandmother, for instance. By considering the other attributes of a synthetic subject’s profile, MOSTLY AI is capable of synthesizing appropriate natural language texts for each of them. Below, you can find two examples. The first example demonstrates MOSTLY AI’s ability to synthesize entirely new names from a multilingual dataset. And the second example shows the result of synthesizing Tripadvisor reviews. Here you can see that the resulting texts accurately retain the context of the establishment they discuss ( |
Multilingual names dataset
Original | Synthetic |
---|---|
|
|
Tripadvisor reviews
Original |
---|
|
Synthetic |
---|
|