CSV requirements
In general, we recommend the usage of PARQUET files, as these are compressed in size as well as contain properly
typed data types. But data can be certainly also provided in CSV format, either uncompressed or compressed as .gz
,
given they adhere to the following rules. They must be encoded in UTF-8, use commas (,
), semicolons (;
) or tab (\t
) as
column separators, and start with a single header line, containing the column names.
1. Header row
- The first row must contain the column names.
- Each column name in a table must be unique.
2. Rows
- Each row in the file must contain the same number of cells.
3. Alphanumeric entries (text, categories, strings)
- Entries containing line breaks, and spaces at the beginning or end, must be quoted with double-quotes.
“this is, one column”
“this is \n two lines”
“ space at the beginning and end “
- double quotes in entries must be escaped with double quotes itself
“this does contain “”quoted text”””
4. Datetime values
- must be encoded in one of the formats below
- missing values must appear as empty strings
Format | Example | |
---|---|---|
Date | yyyy-MM-dd | 2020-02-08 |
Datetime with hours | yyyy-MM-dd HH yyyy-MM-ddTHH yyyy-MM-ddTHHZ | 2020-02-08 09 2020-02-08T09 2020-02-08T09Z |
Datetime with minutes | yyyy-MM-dd HH:mm yyyy-MM-ddTHH:mm yyyy-MM-ddTHH:mmZ | 2020-02-08 09:30 2020-02-08T09:30 2020-02-08T09:30Z |
Datetime with seconds | yyyy-MM-dd HH:mm:ss yyyy-MM-ddTHH:mm:ss yyyy-MM-ddTHH:mm:ssZ | 2020-02-08 09:30:26 2020-02-08T09:30:26 2020-02-08T09:30:26Z |
Datetime with milliseconds | yyyy-MM-dd HH:mm:ss.SSS yyyy-MM-ddTHH:mm:ss.SSS yyyy-MM-ddTHH:mm:ss.SSSZ | 2020-02-08 09:30:26.123 2020-02-08T09:30:26.123 2020-02-08T09:30:26.123Z |
5. Numerical values
- must have a
.
as decimal separator - must not have a thousands separator
- must have missing values encoded as empty strings