MOSTLY AI supports two data structures: single subject tables and subject table - event table datasets. The guidelines below help you convert your dataset into one of these formats, ensuring a carefree synthetic data generation process.

CSV files have a maximum filesize of 5 Gb if you upload them with MOSTLY AI’s user interface.

Formatting your dataset

Single subject table

  • We recommend that subject tables have more than 5000 subjects.
    The minimum size is 100 subjects. However, the more subjects there are available, the better the training algorithm can generalize their features, which results in a decreased privacy risk.

  • The maximum number of subjects and columns is determined by your license.

  • Each subject must refer to a distinct real world entity.

  • Each row describes one subject.

  • Each row can be treated independently.
    The rows' order carries no information, and the contents of one row do not affect other rows.

  • Please ensure that the column names do not contain any privacy-sensitive information.
    Avoid column names such as vendor_a_purchases, vendor_b_purchases, etc.. Not only would vendor names already appear in the metadata, but they could also slip through rare category protection (e.g., there’s a vendor_a column, but this vendor only appeared five times in the whole dataset). You can solve this problem by simply having a vendor column with the vendor names in it.

A subject is an entity or individual whose privacy you are going to protect. A subject table, therefore, contains records that describe these subjects.

Each row in a subject table describes the profile of a unique subject. They contain fields that tell something about them, such as their name, gender, height, place of residence, or income.

In practice, two or more real-world entities may have identical features when they’re described as subjects in the subject table. Conversely, a customer can make several online purchases using different accounts or without logging in to their account. This results in a subject table that contains multiple records with different identifiers for the same person.

MOSTLY AI delivers the most accurate results if the subject table reflects the real world as closely as possible. If real-world entities share identical properties, then this should be left as such. But if multiple records contain the same contact details, it’s plausible that it’s the same person and could be considered for merging.

Below you’ll find an example of a subject table. You can use it as a guideline to create your own.

example subject table


Subject table - event table dataset

  • This structure is ideal for processing sequential and time-series data.

  • It consists of two tables, a subject table that satisfies the requirements stated in section 1.1. Single subject table, and an event table.

  • Each record in the subject table must have a unique ID number (primary key).

  • Each record in the event table must contain the ID of the subject that it’s linked to.

  • Avoid unnecessarily large numbers of records per subject.

MOSTLY AI can process sequential and time-series data if structured as a subject table - event table dataset. Examples of sequential data are insurance claims and patient health records, whereas time-series data describe your subjects' activities over time, such as online shopping journeys, purchase histories, or financial transactions.

Events are processed as properties of subjects. — they cannot exist without subjects, but subjects can have zero events. This relationship guarantees the subjects` privacy during synthesization, which is why these types of data need to be formatted into a subject table and a separate event table.

The image below shows the columns that these tables must have to make this relationship. Each record in your event table must have a field that specifies to which subject it belongs.

example subject table

MOSTLY AI automatically links two tables if the subject table contains a column called id and the second table contains _id in the name of a column (for instance, players_id).

Below you’ll find an example of a basic customer journey dataset with two subjects. Alice Doe made a purchase after visiting the store twice, and Bob Joe was flagged as a churned customer after he no longer showed up for five days.

Subject Table
id        firstName     lastName
1         Alice         Doe
2         Bob           Joe
Event table
users_id  event_time  event_type
1         2020-04-01  visit
1         2020-04-03  visit
1         2020-04-05  purchase
2         2020-03-13  visit
2         2020-03-18  churn
If you have a single table with event data, please split it into a subject and event table accordingly.


General recommendations

We recommend splitting the contents of your fields by their features whenever possible.

For instance, the column street address may contain addresses that all have the same form — 123 example street. In this case, you can split them into the street name and number. MOSTLY AI can then treat street names as a text column and the numbers as numerical variables, which results in improved accuracy of the generated synthetic data.

Content rules

To successfully synthesize your dataset, the content must be encoded in UTF-8 and adhere to the following rules:

Row

  • each row in the file must contain the same number of cells.

Header row

  • the first row must contain the column names

  • each column name in a table must be unique

  • these names must not contain commas (,) or dollar signs ($)

Numerical values

  • must have a . as decimal seperator

  • must not have a thousands separator

  • must have missing values encoded as empty strings

Date and time values

  • must be encoded in one of the below formats

  • must have missing values encoded as empty strings

Format Example

Date

yyyy-MM-dd

2020-02-08

Datetime with hours

yyyy-MM-dd HH
yyyy-MM-ddTHH
yyyy-MM-ddTHHZ

2020-02-08 09
2020-02-08T09
2020-02-08T09Z

Datetime with minutes

yyyy-MM-dd HH:mm
yyyy-MM-ddTHH:mm
yyyy-MM-ddTHH:mmZ

2020-02-08 09:30
2020-02-08T09:30
2020-02-08T09:30Z

Datetime with seconds

yyyy-MM-dd HH:mm:ss
yyyy-MM-ddTHH:mm:ss
yyyy-MM-ddTHH:mm:ssZ

2020-02-08 09:30:26
2020-02-08T09:30:26
2020-02-08T09:30:26Z

Datetime with milliseconds

yyyy-MM-dd HH:mm:ss.SSS
yyyy-MM-ddTHH:mm:ss.SSS
yyyy-MM-ddTHH:mm:ss.SSSZ

2020-02-08 09:30:26.123
2020-02-08T09:30:26.123
2020-02-08T09:30:26.123Z

The following formats are not supported:

  • Any format with a week number.
    Example: 2020-W06-5 (Week 6, Day 5 of 2020)

  • Any format with ordinal dates.
    Example: 2020-039 (Day 39 of 2020)

  • Formats with a time zone offset that don’t contain a Z
    Example: 2020-02-08 09+07:00

  • Short formats that do not contain any special characters, such as -, T, Z, etc.
    Example: 20200208T0930

  • Formats that separate seconds and milliseconds with a comma.
    Example: 2020-02-08T09:30:26,123

  • Formats that separate seconds and milliseconds with a colon.
    Example: 2020-02-08 09:30:26:123

  • Date only formats that have a time zone component.
    Example: 2020-02-08Z

Alphanumeric entries (text, categories, strings)

  • entries containing line breaks, and spaces at the beginning or end, must be quoted with double-quotes.

    “this is, one column”
    “this is \n two lines”
    “ space at the beginning and end “
  • double quotes in entries must be escaped with double quotes itself

    “this does contain “”quoted text”””