GPLOAD utility in greenplum

GPLOAD

gpload is a data loading utility that acts as an interface to Greenplum Database’s external table parallel loading feature. The Greenplum EXTERNAL TABLE feature allows us to define network data sources as tables that we can query to speed up the data loading process. Using a load specification defined in a YAML formatted control file, “gpload” executes a load by invoking the Greenplum parallel file server (gpdist) – Greenplum’s parallel file distribution program, creating an external table definition based on the source data defined, and executing an INSERT, UPDATE or MERGE operation to load the source data into the target table in the database. 

The gpload program processes the control file document in order and uses indentation (spaces) to determine the document hierarchy and the relationships of the sections to one 

another. The use of white space is significant. White space should not be used simply for formatting purposes, and tabs should not be used at all. 

The basic structure of a load control file:

---

VERSION: 1.0.0.1

DATABASE: db_name

USER: db_username

HOST: master_hostname

PORT: master_port

GPLOAD:

INPUT:

- SOURCE:

LOCAL_HOSTNAME:

- hostname_or_ip

PORT: http_port

| PORT_RANGE: [start_port_range, end_port_range]

FILE: 

- /path/to/input_file

SSL: true | false

CERTIFICATES_PATH: /path/to/certificates

- COLUMNS:

- field_name: data_type

- TRANSFORM: 'transformation' 

- TRANSFORM_CONFIG: 'configuration-file-path' 

- MAX_LINE_LENGTH: integer

- FORMAT: text | csv

- DELIMITER: 'delimiter_character'

- ESCAPE: 'escape_character' | 'OFF'

- NULL_AS: 'null_string'

- FORCE_NOT_NULL: true | false

- QUOTE: 'csv_quote_character'

- HEADER: true | false

- ENCODING: database_encoding

- ERROR_LIMIT: integer

- ERROR_TABLE: schema.table_name

OUTPUT:

- TABLE: schema.table_name

- MODE: insert | update | merge

- MATCH_COLUMNS:

- target_column_name

- UPDATE_COLUMNS:

- target_column_name

- UPDATE_CONDITION: 'boolean_condition'

- MAPPING:

target_column_name: source_column_name | 'expression'

PRELOAD:

- TRUNCATE: true | false

- REUSE_TABLES: true | false

SQL:

- BEFORE: "sql_command"

- AFTER: "sql_command"

VERSION

Optional. The version of the gpload control file schema. The current version is 1.0.0.1. 

DATABASE

Optional. Specifies which database in Greenplum to connect to. If not specified, defaults to $PGDATABASE if set or the current system user name. You can also specify the database on the command line using the -d option

USER

Optional. Specifies which database role to use to connect. If not specified, defaults to the current user or $PGUSER if set. You can also specify the database role on the 

command line using the -U option. If the user running gpload is not a Greenplum superuser, then the server configuration parameter gp_external_grant_privileges must be set to on in 

order for the load to be processed.

HOST

Optional. Specifies Greenplum master host name. If not specified, defaults to localhost or $PGHOST if set. You can also specify the master host name on the command line using the -h option.

PORT

Optional. Specifies Greenplum master port. If not specified, defaults to 5432 or $PGPORT if set. You can also specify the master port on the command line using the -p option.

GPLOAD

Required. Begins the load specification section. A GPLOAD specification must have an INPUT and an OUTPUT section defined.

INPUT

Required. Defines the location and the format of the input data to be loaded. gpload will start one or more instances of the gpfdist file distribution program on the current host and create the required external table definition(s) in Greenplum Database that point to the source data. Note that the host from which you run gpload must be accessible over the network by all Greenplum hosts (master and segments).

SOURCE

Required. The SOURCE block of an INPUT specification defines the location of a source file. An INPUT section can have more than one SOURCE block defined. Each SOURCE block defined corresponds to one instance of the gpfdist file distribution program that will be started on the local machine. Each SOURCE block defined must have a FILE specification.

For more information about using the gpfdist parallel file server and single and multiple gpfdist instances, see the Greenplum Database Database Administrator Guide.

LOCAL_HOSTNAME

Optional. Specifies the host name or IP address of the local machine on which gpload is running. If this machine is configured with multiple network interface cards (NICs), you can specify the host name or IP of each individual NIC to allow network traffic to use all NICs simultaneously. The default is to use the local machine’s primary host name or IP only. 

PORT

Optional. Specifies the specific port number that the gpfdist file distribution program should use. You can also supply a PORT_RANGE to select an available port from the specified range. If both PORT and PORT_RANGE are defined, then PORT takes precedence. If neither PORT or PORT_RANGE are defined, the default is to select an available port between 

8000 and 9000. 

If multiple host names are declared in LOCAL_HOSTNAME, this port number is used for all hosts. This configuration is desired if you want to use all NICs to load the same file or set of files in a given directory location.

PORT_RANGE

Optional. Can be used instead of PORT to supply a range of port numbers from which gpload can choose an available port for this instance of the gpfdist file distribution program.

FILE

Required. Specifies the location of a file, named pipe, or directory location on the local file system that contains data to be loaded. You can declare more than one file so long as the data is of the same format in all files specified. If the files are compressed using gzip or bzip2 (have a .gz or .bz2 file extension), the files will be uncompressed automatically (provided that 

gunzip or bunzip2 is in your path).

When specifying which source files to load, you can use the wildcard character (*) or other C-style pattern matching to denote multiple files. The files specified are assumed to be relative to the current directory from which gpload is executed (or you can declare an absolute path).

SSL

Optional. Specifies usage of SSL encryption. If SSL is set to true, gpload starts the gpfdist server with the --ssl option and uses the gpfdists protocol.

CERTIFICATES_PATH

Required when SSL is true; cannot be specified when SSL is false or unspecified. The location specified in CERTIFICATES_PATH must contain the following files:

• The server certificate file, server.crt

• The server private key file, server.key

• The trusted certificate authorities, root.crt

The root directory (/) cannot be specified as CERTIFICATES_PATH.

COLUMNS

Optional. Specifies the schema of the source data file(s) in the format of field_name: data_type. The DELIMITER character in the source file is what separates two data value fields (columns). A row is determined by a line feed character (0x0a). If the input COLUMNS are not specified, then the schema of the output TABLE is implied, meaning that the source data must have the same column order, number of columns, and data format as the target table.

The default source-to-target mapping is based on a match of column names as defined in this section and the column names in the target TABLE. This default mapping can be overridden using the MAPPING section.

TRANSFORM

Optional. Specifies the name of the input XML transformation passed to gpload. For more information about XML transformations, see the Greenplum Database Database Administrator Guide.

TRANSFORM_CONFIG

Optional. Specifies the location of the XML transformation configuration file that is specified in the TRANSFORM parameter, above.

MAX_LINE_LENGTH

Optional. An integer that specifies the maximum length of a line in the XML transformation data passed to gpload.

FORMAT

Optional. Specifies the format of the source data file(s) - either plain text (TEXT) or comma separated values (CSV) format. Defaults to TEXT if not specified. For more information about the format of the source data, see the Greenplum Database Database Administrator Guide.

DELIMITER

Optional. Specifies a single ASCII character that separates columns within each row (line) of data. The default is a tab character in TEXT mode, a comma in CSV mode. You can also specify a non-printable ASCII character via an escape sequence using the Unicode representation of the ASCII character. For example, "\u001B" represents the escape character. The Unicode representation must be encoded in double-quotes ( " ) instead of quotes ( ' ).

ESCAPE

Specifies the single character that is used for C escape sequences (such as \n,\t,\100, and so on) and for escaping data characters that might otherwise be taken as row or column delimiters. Make sure to choose an escape character that is not used anywhere in your actual column data. The default escape character is a \ (backslash) for text-formatted files and a " (double quote) for csv-formatted files, however it is possible to specify another character to represent an escape. It is also possible to disable escaping in text-formatted files by specifying the value 'OFF' as the escape value. This is very useful for data such as text-formatted web log data that has many embedded backslashes that are not intended to be escapes.

NULL_AS

Optional. Specifies the string that represents a null value. The default is \N (backslash-N) in TEXT mode, and an empty value with no quotations in CSV mode. You might prefer an empty string even in TEXT mode for cases where you do not want to distinguish nulls from empty strings. Any source data item that matches this string will be considered a null value.

FORCE_NOT_NULL

Optional. In CSV mode, processes each specified column as though it were quoted and hence not a NULL value. For the default null string in CSV mode (nothing between two delimiters), this causes missing values to be evaluated as zero-length strings.

QUOTE

Required when FORMAT is CSV. Specifies the quotation character for CSV mode. The default is double-quote (").

HEADER

Optional. Specifies that the first line in the data file(s) is a header row (contains the names of the columns) and should not be included as data to be loaded. If using multiple data source files, all files must have a header row. The default is to assume that the input files do not have a header row.

ENCODING

Optional. Character set encoding of the source data. Specify a string constant (such as 'SQL_ASCII'), an integer encoding number, or 'DEFAULT' to use the default client encoding. If not specified, the default client encoding is used. For information about supported character sets, see the the Greenplum Database Reference Guide.

ERROR_LIMIT

Optional. Enables single row error isolation mode for this load operation. When enabled, input rows that have format errors will be discarded provided that the error limit count is not reached on any Greenplum segment instance during input processing. If the error limit is not reached, all good rows will be loaded and any error rows will either be discarded or logged to the table specified in ERROR_TABLE. The default is to abort the load operation on the first error encountered. Note that single row error isolation only applies to data rows with format errors; for example, extra or missing attributes, attributes of a wrong data type, or invalid client encoding sequences.

Constraint errors, such as primary key violations, will still cause the load operation to abort if encountered. For information about handling load errors, see the Greenplum Database Database Administrator Guide.

ERROR_TABLE

Optional when ERROR_LIMIT is declared. Specifies an error table where rows with formatting errors will be logged when running in single row error isolation mode. You can then examine this error table to see error rows that were not loaded (if any). If the error_table specified already exists, it will be used. If it does not exist, it will be automatically generated. For 

information about handling load errors, see the Greenplum Database Database Administrator Guide.

OUTPUT

Required. Defines the target table and final data column values that are to be loaded into the database.

TABLE

Required. The name of the target table to load into.

MODE

Optional. Defaults to INSERT if not specified. There are three available load modes:

INSERT - Loads data into the target table using the following method: 

INSERT INTO target_table SELECT * FROM input_data;

UPDATE - Updates the UPDATE_COLUMNS of the target table where the rows have MATCH_COLUMNS attribute values equal to those of the input data, and the optional UPDATE_CONDITION is true. 

MERGE - Inserts new rows and updates the UPDATE_COLUMNS of existing rows where MATCH_COLUMNS attribute values are equal to those of the input data, 

and the optional UPDATE_CONDITION is true. New rows are identified when the MATCH_COLUMNS value in the source data does not have a corresponding value in the existing data of the target table. If there are multiple new MATCH_COLUMNS values that are the same, only one new row for that value will be inserted. Use UPDATE_CONDITION to filter out the rows to discard. 

MATCH_COLUMNS

Required if MODE is UPDATE or MERGE. Specifies the column(s) to use as the join condition for the update. The attribute value in the specified target column(s) must be equal to that of the corresponding source data column(s) in order for the row to be updated in the target table. 

UPDATE_COLUMNS

Required if MODE is UPDATE or MERGE. Specifies the column(s) to update for the rows that meet the MATCH_COLUMNS criteria and the optional UPDATE_CONDITION.

UPDATE_CONDITION

Optional. Specifies a Boolean condition (similar to what you would declare in a WHERE clause) that must be met in order for a row in the target table to be updated (or inserted in the case of a MERGE).

MAPPING

Optional. If a mapping is specified, it overrides the default source-to-target column mapping. The default source-to-target mapping is based on a match of column names as defined in the source COLUMNS section and the column names of the target TABLE. A mapping is specified as either:  target_column_name: source_column_name or  target_column_name: 'expression'  Where expression is any expression that you would specify in the SELECTlist of a query, such as a constant value, a column reference, an operator invocation, a function call, and so on.

PRELOAD

Optional. Specifies operations to run prior to the load operation. Right now the only preload operation is TRUNCATE.

TRUNCATE

Optional. If set to true, gpload will remove all rows in the target table prior to loading it.

REUSE_TABLES

Optional. If set to true, gpload will not drop the external table objects and staging table objects it creates. These objects will be reused for future load operations that use the same load specifications. This improves performance of trickle loads (ongoing small loads to the same target table).

SQL

Optional. Defines SQL commands to run before and/or after the load operation. You can specify multiple BEFORE and/or AFTER commands. List commands in the order of desired execution.

BEFORE

Optional. An SQL command to run before the load operation starts. Enclose commands in quotes.

AFTER

Optional. An SQL command to run after the load operation completes. Enclose commands in quotes.

Notes

If your database object names were created using a double-quoted identifier (delimited identifier), you must specify the delimited name within single quotes in the gpload control file. For example, if you create a table as follows:

CREATE TABLE "MyTable" ("MyColumn" text);

Your YAML-formatted gpload control file would refer to the above table and column names as follows:

- COLUMNS:

- '"MyColumn"': text

OUTPUT:

- TABLE: public.'"MyTable"'

Log File Format

Log files output by gpload have the following format:

timestamp|level|message

Where timestamp takes the form: YYYY-MM-DD HH:MM:SS, level is one of DEBUG, LOG, INFO, ERROR, and message is a normal text message.

Some INFO messages that may be of interest in the log files are (where # corresponds to the actual number of seconds, units of data, or failed rows):

INFO|running time: #.## seconds

INFO|transferred #.# kB of #.# kB.

INFO|gpload succeeded

INFO|gpload succeeded with warnings

INFO|gpload failed

INFO|1 bad row

INFO|# bad rows

MAX_LINE_LENGTH value that you set in YAML file gets passed to gpfdist . The default for gpfdist is 32768 (32KB). Max restriction 268435456 (256MB) (1 MB on Windows).

Above example shows syntax for GPLOAD using YAML file. This file is divided into sections for easy reference, those horizontal lines are not to be placed in a YAML file. For example, 

users can run a load job as defined in my_load.yml using gpload:

gpload -f my_load.yml 

Example.

[sachi@sachi mdp]$ cat test.yml 

%YAML 1.1

---

VERSION: 1.0.0.1

 

DATABASE: sachi

USER: sachi

PASSWORD: sachi

HOST: sachi

PORT: 5432

GPLOAD:

   INPUT:

     - SOURCE:

        LOCAL_HOSTNAME:

         - sachi

        PORT: 8082

        FILE:

         - /home/sachi/dia1data1/REGIONS.csv

     - COLUMNS:

       - region_id: numeric

       - region_name: text

     - FORMAT: csv

     - DELIMITER: ","

     - ESCAPE: '/'

     - NULL_AS: '/N'

     - QUOTE: '"'

     - ENCODING: 'utf8'

     - ERROR_LIMIT: 5

     - ERROR_TABLE: sachi.ext_regions_error

   OUTPUT:

     - TABLE: sachi.regions

     - MODE: INSERT

   PRELOAD:

[sachi@sachi mdp]$ gpload -f test.yml

2013-05-03 14:52:19|INFO|gpload session started 2013-05-03 14:52:19

2013-05-03 14:52:19|INFO|started gpfdist -p 8082 -P 8083 -f "/home/sachi/dia1data1/REGIONS.csv" -t 30

2013-05-03 14:52:19|INFO|running time: 0.25 seconds

2013-05-03 14:52:19|INFO|rows Inserted          = 4

2013-05-03 14:52:19|INFO|rows Updated           = 0

2013-05-03 14:52:19|INFO|data formatting errors = 0

2013-05-03 14:52:19|INFO|gpload succeeded

[sachi@sachi mdp]$