Formatting Data Files for Greenplum LOAD/UNLOAD process
Formatting Data Files
When you use the various Greenplum tools for loading and unloading data, you need to specify how your data is formatted. COPY, CREATE EXTERNAL TABLE and gpload have clauses that allow you to specify how your data is formatted. Data can be in either delimited text (TEXT) or comma separated values (CSV) format. External data must be formatted correctly in order to be read by Greenplum Database. This section explains the format of data files expected by Greenplum Database.
Formatting Rows
Formatting Columns
Representing NULL Values
Escaping
Character Encoding
1. Formatting Rows
Greenplum Database expects rows of data to be separated by the LF character (Line feed, 0x0A), CR (Carriage return, 0x0D), or CR followed by LF (CR+LF, 0x0D 0x0A). LF is the standard newline representation on UNIX or UNIX-like operating systems. Other operating systems (such as Windows or Mac OS 9) may use CR individually, or CR+LF. All of these representations of a newline are supported by Greenplum Database as a row delimiter.
2. Formatting Columns
The default column or field delimiter is the horizontal TAB character (0x09) for text files and the comma character (0x2C) for CSV files. However, it is possible to declare another single character delimiter using the DELIMITER clause of COPY, CREATE EXTERNAL TABLE or gpload when defining your data format. The delimiter character must only appear between any two data value fields. Do not place a delimiter at the beginning or end of a row. For example, if using the pipe character ( | ) as your delimiter:
data value 1|data value 2|data value 3
3. Representing NULL Values
NULL is the value used to represent an unknown piece of data in a column or field. Within your data files you can designate a string to represent null values. The default string is \N (backslash-N) in TEXT mode, or an empty value with no quotations in CSV mode. You can also declare a different string using the NULL clause of COPY, CREATE EXTERNAL TABLE or gpload when defining your data format. For example, you might prefer an empty string for cases where you do not want to distinguish nulls from empty strings. When using the Greenplum Database loading tools, any data item that matches the designated null string will be considered a null value.
4. Escaping
The data file has two reserved characters that have special meaning to Greenplum Database:
•The designated delimiter character, which is used to separate columns or fields in the data file.
•The newline character used to designate a new row in the data file.
If your data contains either of these characters, you must escape the character so Greenplum treats it as data and not as a field separator or new row. By default, the escape character is a \ (backslash) for text-formatted files and a double quote (") for csv-formatted files.
Escaping in Text Formatted Files
By default, the escape character is a \ (backslash) for text-formatted files. If you want to use a different escape character, use the ESCAPE clause of COPY, CREATE EXTERNAL TABLE or gpload to declare a different escape character. In cases where your selected escape character is present in your data, you can use it to escape itself.
For example, suppose you have a table with three columns and you want to load the following three fields:
•backslash = \
•vertical bar = |
•exclamation point = !
Your designated delimiter character is | (pipe character), and your designated escape character is \ (backslash). The formatted row in your data file would look like this:
backslash = \\ | vertical bar = \| | exclamation point = !
Notice how the backslash character that is part of the data is escaped with another backslash character, and the pipe character that is part of the data is escaped with a backslash character.
The escape character can also be used to escape octal and hexidecimal sequences. When used in this way, the escaped value will get converted to the equivalent character when loaded into Greenplum Database. For example, to load the ampersand character (&), you could use the escape character to escape its equivalent hexidecimal (\0x26) or octal (\046) representation.
If there is no need to escape the data in text-formatted files, you can disable escaping using the ESCAPE clause of COPY, CREATE EXTERNAL TABLE or gpload as follows:
ESCAPE 'OFF'
This is useful for input data that contains a lot of backslash characters within the data itself (such as web log data).
Escaping in CSV Formatted Files
By default, the escape character is a " (double quote) for CSV-formatted files. If you want to use a different escape character, use the ESCAPE clause of COPY, CREATE EXTERNAL TABLE or gpload to declare a different escape character. In cases where your selected escape character is present in your data, you can use it to escape itself.
For example, suppose you have a table with three columns and you want to load the following three fields:
•Free trip to A,B
•5.89
•Special rate "1.79"
Your designated delimiter character is , (comma), and your designated escape character is " (double quote). The formatted row in your data file would look like this:
"Free trip to A,B","5.89","Special rate ""1.79"""
Notice how that for the comma character that is part of the data, the entire data value is enclosed in double quotes. Also notice how the double quotes that are part of the data are also escaped with a double quote even though the field value is enclosed in double quotes.
Embedding the entire field inside a set of double quotes also guarantees preservation of leading and trailing whitespace characters:
"Free trip to A,B ","5.89 ","Special rate ""1.79"" "
Note: In CSV mode, all characters are significant. A quoted value surrounded by white space, or any characters other than DELIMITER, will include those characters. This can cause errors if you import data from a system that pads CSV lines with white space out to some fixed width. If such a situation arises you might need to preprocess the CSV file to remove the trailing white space, before importing the data into Greenplum Database.
Character Encoding
A character encoding system consists of a code that pairs each character from a given repertoire with something else, such as a sequence of numbers or octets, in order to facilitate the transmission and storage of data. The character set support in Greenplum Database allows you to store text in a variety of character sets, including single-byte character sets such as the ISO 8859 series and multiple-byte character sets such as EUC (Extended UNIX Code), UTF-8, and Mule internal code. All supported character sets can be used transparently by clients, but a few are not supported for use within the server (that is, as a server-side encoding).
Data files must be in a character encoding recognized by Greenplum Database. Data files that contain invalid or unsupported encoding sequences will encounter errors when loading into Greenplum Database.
Note: On data files generated on a Microsoft Windows operating system, try running the dos2unix system command to remove any Windows-only characters prior to loading into Greenplum Database.