gpfdist

gpfdist

Serves data files to or writes data files out from Greenplum Database segments.

Synopsis

gpfdist [-d directory] [-p http_port] [-l log_file] [-t timeout] 

[-S] [-v | -V] [-m max_length] [--ssl certificate_path]

gpfdist -? 

gpfdist --version

Default row limit for external tables using gpfdist is 32KB as documented, if certain rows are longer than 32KB , the query will error out like above. Use -m option to increase the max row length for gpfdist, this value can be increased up to 256MB. For example, increase the vales to 64KB solved the issue in our test case.gpfdist -t 600 -d /tmp -l /tmp/1.log -p 8090 -V -m 655350 &

In a large scale, multi-terabyte data warehouse, large amounts of data must be loaded within a relatively small maintenance window. Greenplum supports fast, parallel data 

loading with its external tables feature. Administrators can also load external tables in single row error isolation mode to filter bad rows into a separate error table while 

continuing to load properly formatted rows. Administrators can specify an error threshold for a load operation to control how many improperly formatted rows cause 

Greenplum to abort the load operation.

By using external tables in conjunction with Greenplum Database’s parallel file server (gpfdist), administrators can achieve maximum parallelism and load bandwidth 

from their Greenplum Database system. 

External Tables Using Greenplum Parallel File Server (gpfdist)

Note: If you have upgraded your greenplum database to a new version and If your system uses external tables with gpfdist, stop all gpfdist processes on 

your ETL servers and reinstall gpfdist using the compatible Greenplum Database 4.2.x.x Load Tools package. Application Packages are available at the 

EMC Download Center.

Check if gpfdist version is same as Greenplum database version.

$ gpfdist --version

gpfdist version "3.2.1.0 build 7"

$ psql -c 'select version();'

PostgreSQL 8.2.10 (Greenplum Database 3.2.1.0 build 7)

Resolved issues

Version 4.2.5.1

18941 Loaders 4.2.5.0 The maximum timeout value for the gpfdist utility -t option was 30 seconds. The maximum value is increased to 600 seconds. 

18703 Loaders The bytenum field (byte offset in the load file where the error occurred) in the error  log when using gpfdist with data in text format errors is not populated, making it difficult to find the location of an error in the source file.

18414 Loaders Unable to define external tables with fixed width format and empty line delimiter when file size is larger than gpfdist chunk (by default, 32K).

17189 Loaders: gpfdist gpfdist shows the error “Address already in use” after successfully binding to socket IPv6.Greenplum supports IPv4 and IPv6. However, gpfdist fails to bind to socket IPv4, and shows the message “Address already in use”, but binds successfully to socket IPv6. 

Version 4.2.5.2

18941 Loaders 4.2.5.0 The maximum timeout value for the gpfdist utility -t option was 30 seconds. The maximum value is increased to 600 seconds.

18703 Loaders The bytenum field (byte offset in the load file where the error occurred) in the error log when using gpfdist with data in text format errors is not populated, making it difficult to find the location of an error in the source file.

18414 Loaders Unable to define external tables with fixed width format and empty line delimiter when file size is larger than gpfdist chunk (by default, 32K).

17189 Loaders: gpfdist gpfdist shows the error “Address already in use” after successfully binding to socket IPv6. Greenplum supports IPv4 and IPv6. However, gpfdist fails to bind to socket IPv4, and shows the message “Address already in use”, but binds successfully to socket IPv6. 

Version 4.2.6.3

19967 Loaders: gpfdist 4.2.6.1 In some cases on large clusters, using the Greenplum gpfdist utility to write to Greenplum Database external writable tables failed with a timeout or connection error

20294 Loaders: gpfdist 4.2.6.1 On Windows client machines, when the Greenplum gpfdist utility wrote to files that are named pipes, gpfdist did not close the files after writing to the files was completed. 

18703 Loaders The bytenum field (byte offset in the load file where the error occurred) in the error log when using gpfdist with data in text format errors is not populated, making it difficult to find the location of an error in the source file. 

18414 Loaders Unable to define external tables with fixed width format and empty line delimiter when file size is larger than gpfdist chunk (by default, 32K).

17189 Loaders: gpfdist gpfdist shows the error “Address already in use” after successfully binding to socket IPv6. Greenplum supports IPv4 and IPv6. However, gpfdist fails to bind to socket IPv4, and shows the message “Address already in use”, but binds successfully to socket IPv6. 

Description

gpfdist is Greenplum’s parallel file distribution program. It is used by readable external tables and gpload to serve external table files to all Greenplum Database segments in parallel. It is used by writable external tables to accept output streams from Greenplum Database segments in parallel and write them out to a file. In order for gpfdist to be used by an external table, the LOCATION clause of the external table definition must specify the correct file location using the gpfdist:// protocol (see the Greenplum Database command CREATE EXTERNAL TABLE). The benefit of using gpfdist is that you are guaranteed maximum parallelism while reading from or writing to external tables, thereby offering the best performance as well as easier administration of external tables.

For readable external tables, gpfdist parses and serves data files evenly to all the segment instances in the Greenplum Database system when users SELECT from the external table. For writable external tables, gpfdist accepts parallel output streams from the segments when users INSERT into the external table, and writes to an output file.

For readable external tables, if load files are compressed using gzip or bzip2 (have a .gz or .bz2 file extension), gpfdist uncompresses the files automatically before 

loading provided that gunzip or bunzip2 is in your path. 

Note: Currently, readable external tables do not support compression on Windows platforms, and writable external tables do not support compression on any platforms. Most likely, you will want to run gpfdist on your ETL machines rather than the hosts where Greenplum Database is installed. To install gpfdist on another host, simply  copy the utility over to that host and add gpfdist to your $PATH. Note: When using IPv6, always enclose the numeric IP address in brackets.

Default row limit for external tables using gpfdist is 32KB as documented, if certain rows are longer than 32KB , the query will error out like above. Use -m option to increase the max row length for gpfdist, this value can be increased up to 256MB. For example, increase the vales to 64KB solved the issue in our test case.gpfdist -t 600 -d /tmp -l /tmp/1.log -p 8090 -V -m 655350 &

Options

-d directory

The directory from which gpfdist will serve files for readable external tables or create output files for writable external tables. If not specified, defaults to the current directory.

-l log_file

The fully qualified path and log file name where standard output messages are to be logged.

-p http_port

The HTTP port on which gpfdist will serve files. Defaults to 8080.

-t timeout

Sets the time allowed for Greenplum Database to establish a connection to a gpfdist process. Default is 5 seconds. Allowed values are 2 to 600 seconds. May need to be increased on systems with a lot of network traffic.

-S (use O_SYNC)

Opens the file for synchronous I/O with the O_SYNC flag. Any writes to the resulting file descriptor block gpfdist until the data is physically written to the underlying hardware.

-m max_length

Sets the maximum allowed data row length in bytes. Default is 32768. Should be used when user data includes very wide rows (or when line too long error message occurs). Should not be used otherwise as it increases resource allocation. Valid range is 32K to 256MB.

-v (verbose)

Verbose mode shows progress and status messages.

-V (very verbose)

Verbose mode shows all output messages generated by this utility.

--ssl certificate_path

Adds SSL encryption to data transferred with gpfdist. After executing gpfdist with the --ssl certificate_path option, the only way to load data from this file server is with the gpfdists protocol. For information on the gpfdists protocol, see the Greenplum Database Database Administrator Guide. The location specified in certificate_path must contain the following files:

• The server certificate file, server.crt

• The server private key file, server.key

• The trusted certificate authorities, root.crt

The root directory (/) cannot be specified as certificate_path.

-? (help)

Displays the online help.

--version

Displays the version of this utility.

Examples

Serve files from a specified directory using port 8081 (and start gpfdist in the background):

gpfdist -d /var/load_files -p 8081 &

Start gpfdist in the background and redirect output and errors to a log file:

gpfdist -d /var/load_files -p 8081 -l /home/gpadmin/log &

To stop gpfdist when it is running in the background:

--First find its process id:

ps ax | grep gpfdist

OR on Solaris

ps -ef | grep gpfdist

--Then kill the process, for example:

kill 3456

How to troubleshoot Greenplum file distribution program (gpfdist) issues?

Issue Issues with file distribution program (gpfdist) 

ERROR: Out of memory

"Invalid syntax" error message while reading from External Table. 

Environment EMC SW: Greenplum Database

Resolution

Greenplum file distribution program (gpfdist) runs on the host where the external data files reside. This program points to a given directory on the file host and serves external data files to all Greenplum Database segments in parallel.

If you encounter issues with gpfdist, refer to the items listed below before contacting Greenplum customer support.

Is gpfdist the same version as the database?

$ gpfdist --version

gpfdist version "3.2.1.0 build 7"

$ psql -c 'select version();'

PostgreSQL 8.2.10 (Greenplum Database 3.2.1.0 build 7)

Can you get the file from the ETL host to the Greenplum array using WGET or CURL?

$ wget http://MACHINE:8080/filename

Notes:

Solaris - You may not have wget set in your path so you need to find it (export PATH=/usr/sfw/bin:/usr/opt/bin:/usr/bin:/opt/sfw/bin:/usr/local/bin:/usr/sbin:/usr/ccs/bin:$PATH)

If you do not have wget, use curl$:

curl -s -S ™http://MACHINE:8080/filename™

If you are not able to get the file using wget (curl), address issues with the network before attempting another gpfdist operation.

Error messages like 'invalid syntax' or inconsistent results when reading from External Table and using LIMIT clause:

Make sure you are not using 'limit x' in your query. When gpfdist hands out data to the backend it does so on a "first come first serve" basis. Unlike COPY or external table 'file' protocol, where the data processing is more sequential, with gpfdist you can never know which segdb will process which chunk of the data, nor when. It is essentially a race. If there is a bad data row in your data, it may or may not show up in your LIMIT query, because it depends on whether the segdb that received the chunk of data with that bad row managed to process it in time, before the other segdbs processed their chunks of data, and before the executor shuts down because LIMIT is satisfied.

Getting "ERROR: Out of memory" while reading from External Table

If you are using error tables for rejected rows in your external tables DDL, and during your ETL process many bad rows are rejected, you might run into a known memory leak issue while using Single Row Error Handling (SREH).

Affected versions: 3.1.x

This issue is fixed in 3.2.0.0 and later versions.

Note: It is required to have connectivity between the ETL host and the segment nodes.

The gpfdist processes on the prod ETL servers are dying, They must be restarted on a frequent basis. This is impacting workflow.