Unloading data from Greenplum database


A writable external table allows you to select rows from other database tables and output the rows to files, named pipes, or to other executable programs. Writable external tables can also be used as output targets for Greenplum parallel MapReduce calculations. There are two ways to define writable external tables, file-based and web-based tables.

Unloading Data Using a Writable External Table
Because writable external tables only allow INSERT operations, users who are not the table owner or a superuser must be granted INSERT permissions on the table in order to use it. For example:

GRANT INSERT ON writable_ext_table TO admin;

To unload data using a writable external table, select the data from the source table(s) and insert it into the writable external table. The rows produced by the SELECT statement will be output to the writable external table. For example:

INSERT INTO writable_ext_table SELECT * FROM regular_table;

To source large volumes of data from Greenplum performance over ODBC may not be sufficient. If not then consider using the Greenplum “File-Based Writable External Table” feature. See the Greenplum Admin Guide for more information on this feature.

The following steps are required to integrate Greenplum file-based writable external tables and Informatica:

1. Using a shell script (perhaps executed from a command task)
a. Create an empty directory on the Informatica server, e.g. mkdir /path/to/new/dir
b. Start gpfdist on an available port against that directory, e.g. gpfdist -d /path/to/new/dir -p 8081 -l /path/to/log/dir/gpfdist.log &
c. Create a named pipe in the directory, e.g. mkfifo /path/to/new/dir/data.fifo

2. Execute sql commands against the Greenplum database (perhaps in simple mappings using a SQLT):
a. First create a writable external table, e.g. create writable external table gpinfa.unload_us_person (like gpinfa.us_person)
location ('gpfdist://<InformaticaServer>:8081/us_person1.out')
format 'TEXT' (DELIMITER ',')
distributed by (person_id);
b. Insert records from the table to be sourced into the writable external table, e.g. insert into gpinfa.unload_us_person select * from gpinfa.us_person;

3. At this point Greenplum will start to send data to the gpfdist process on the Informatica server, which in turn will write the data to the named pipe. This process will be blocked and will not complete until another process reads from the named pipe. At this point the Informatica session can begin reading the data from the named pipe and processing it as desired.

4. Once the session has completed the insert statement above will complete. Additional SQL commands can be executed to drop the file-based writable external table. Again this could perhaps be done using a mapping that executes after the main load and uses a SQLT to execute the SQL, e.g. drop external table gpinfa.unload_us_person;

5. Final clean-up on the file system can be done using command tasks.
a. Delete the named pipe, e.g. rm /path/to/new/dir/data.fifo
b. Stop gpfdist. There are many ways to do this, but an easy way is to pass part or all of the command executed earlier to the pkill command, e.g. pkill -f -u <user> " gpfdist -d /path/to/new/dir -p 8081 -l /path/to/log/dir/gpfdist.log "
c. Remove the directory that was created, e.g. rmdir /path/to/new/dir

Unloading Data Using COPY
COPY TO copies data from a table to a file (or standard input) on the Greenplum master host. COPY is non-parallel, meaning that data is unloaded in a single process via the Greenplum master instance. You can use COPY to output a table’s entire contents, or filter the output using a SELECT statement. For example:
COPY (SELECT * FROM country WHERE country_name LIKE 'A%') TO '/home/gpadmin/a_list_countries.out';


Let's discuss in details about creating writable external table.



Creating a Writable External Table

1. File-Based Writable External Table
2. Command-Based Writable External Web Table

1. File-Based Writable External Table
Writable external tables that output data to files use the Greenplum parallel file server program (gpfdist) or the Hadoop Distributed File System interface (gphdfs), just as file-based readable external tables do.

Use the CREATE WRITABLE EXTERNAL TABLE command to define the external table and specify the location and format of the output files.

1. With a writable external table that uses the gpfdist protocol, the Greenplum segments send their output rows to the gpfdist process, which then writes out the data to the named file. You must have the Greenplum file distribution program (gpfdist) running on a host that is accessible over the network by the Greenplum segments. This program points to a given file location on the output host and writes output data received from the Greenplum segments to this file. You can list multiple gpfdist URIs in your writable external table definition if you want your output data to be divided across multiple files.

2. A writable external web table outputs data to an executable program. For example, you could unload data from Greenplum Database and send it to an executable that connects to another database or ETL tool and loads the data elsewhere. Web tables use the EXECUTE clause to specify a shell command, script, or executable program to run on the segment hosts. For writable external web tables, this command must be prepared to accept an input stream of data.

Unlike readable external tables, writable external tables also have the option to declare a distribution policy. By default, writable external tables use a random distribution policy. If the source table you are exporting data from has a hash distribution policy, defining the same distribution key column(s) for the writable external table will improve unload performance by eliminating the need to move rows over the interconnect. If you are unloading data from a particular table, you can use the LIKE clause to copy the column definitions and distribution policy from the source table.

Example 1—Greenplum file server (gpfdist)

CREATE WRITABLE EXTERNAL TABLE unload_expenses
( LIKE expenses )
LOCATION ('gpfdist://etlhost-1:8081/expenses1.out',
'gpfdist://etlhost-2:8081/expenses2.out')
FORMAT 'TEXT' (DELIMITER ',')
DISTRIBUTED BY (exp_id);

Example 2—Hadoop file server (gphdfs)

CREATE WRITABLE EXTERNAL TABLE unload_expenses
( LIKE expenses )
LOCATION ('gphdfs://hdfslhost-1:8081/path')
FORMAT 'TEXT' (DELIMITER ',')
DISTRIBUTED BY (exp_id);

There are two additional restrictions for writing file-based external tables with the gphdfs protocol, as follows:

1. You can specify only a directory for a writable external table with the gphdfs protocol. 
 (You can specify only one file for a readable external table with the gphdfs protocol.)
2. TEXT is the only format allowed.

Note: The default port number is 9000.


2. Command-Based Writable External Web Table
Similar to readable web external tables that execute a command or program, writable external web tables can also be defined to send output rows to an executable program or script. The executable program used in the writable external table definition must be prepared to accept an input stream, reside in the same location on all of the Greenplum segment hosts, and be executable by the gpadmin user. The command specified in your writable external table definition is executed by all segments in the Greenplum system, whether or not a segment has output rows to process or not.

Use the CREATE WRITABLE EXTERNAL WEB TABLE command to define the external table and specify the executable command or program to run on the segment hosts. If you use environment variables in the executable command (such as $PATH), keep in mind that the command is executed from within the database and not from a login shell. Therefore the .bashrc or .profile of the current user will not be sourced. However, you can set desired environment variables from within the EXECUTE clause of your writable external table definition, for example:

CREATE WRITABLE EXTERNAL WEB TABLE output (output text)
EXECUTE 'export PATH=$PATH:/home/gpadmin/programs;
myprogram.sh'
FORMAT 'TEXT'
DISTRIBUTED RANDOMLY;

Disabling EXECUTE for Web or Writable External Tables
External tables that execute OS commands or scripts have a certain security risk associated with them. Some database administrators may decide that they do not want their Greenplum Database systems exposed to this functionality. If this is the case, you can disable the use of EXECUTE in web and writable external table definitions by setting the following server configuration parameters in your master postgresql.conf file:
gp_external_enable_exec = off

Using Environment Variables in EXECUTE commands
If you use environment variables in external table EXECUTE commands (such as $PATH), keep in mind that the command is executed from within the database and not from a login shell. Therefore the .bashrc or .profile of the current user is not sourced. However, you can set desired environment variables from within the EXECUTE clause of your external table definition, for example:
CREATE EXTERNAL WEB TABLE test (column1 text) EXECUTE 'MAKETEXT=text-text-text; export MAKETEXT; echo $MAKETEXT' FORMAT 'TEXT';
SELECT * FROM test;
column1
----------------
text-text-text
text-text-text
(2 rows)


STEP BY STEP INSTRUCTIONS TO UNLOAD DATA FROM GPDB USING EXTERNAL TABLE

Prerequisites

External tables can write data only to servers connected to Greenplum’s 10 GB network. So the extracting process needs to have a login and assigned storage (disk quotas) on commodity/ ETL server.


Steps

1. Start gpfdist on ETL server

Start one or multiple gpfdist processes on the gpfdist server like below:

gpfdist -p 8080 -d /data1/projectdir -m 4000000 -l /data1/projectdir/gpfdist.log &
gpfdist -p 8081 -d /data1/projectdir -m 4000000 -l /data1/projectdir/gpfdist.log &

You can use more than one gpfdist instances to increase the speed of data extraction. The number of gpfdist instances may be limited by disk write speed. The extracted data files will go to /data1/projectdir (as indicated by the -d option), so there should be enough available space in the disk to extract the desired volume of data. Alternatively, the extraction can also be in a FIFO pipe. 

Each process should be checked for success (i.e. if there is another process running on the same port, gpfdist will not start). Also the process ids should be saved so the gpfdist processes can be terminated after the data extraction process.

 

2. Create external table and extract data

This will depend on a few things.

1. The structure of the data (columns) being downloaded. It could be the all columns of a specific table or the columns of a custom SQL.
2. The host name of the gpfdist server
3. The port numbers of the gpfdist instances stared in the previous step.
4. The file names that would be used.
5. The format of the data file. The options supported are delimited text files and csv. However, more complex formats can be supported by the gpfdist formatter extension.

Case 1 (Extract all columns of a table):

Let us assume the following:

a. We will be extracting all columns of table demotable.
b. The ETL server is etl1,
c. gpfdist is started on ports 8080 and 8081 and the corresponding file names we will extract to are data1.txt and data2.txt
d. We will use the default text format (tab delimited data)

Based on these assumptions we should create our external table and extract data like the following:

drop external table if exists ext_demotable;

create external table ext_demotable (like demotable)

location (
'gpfdist://etl1:8080/data1.txt',
'gpfdist://etl1:8081/data2.txt'
)
format 'text';

insert into ext_demotable select * from demotable;

drop external table ext_demotable;
  
Case 2 (Extract data using a custom SQL):

Instead of extracting all columns from demotable, we will extract data using a custom SQL like below:

The external table can be created and used like below: 

select col1, col2
from demotable
where create_date >= current_date - interval '1 month';

Comment: Using GP temp table is allowed for external table process.

create temp table t1 as

select col1, col2
from demotable
where create_date >= current_date - interval '1 month'
limit 0
distributed randomly;

drop external table if exists ext_demotable;

create external table ext_custom_messages (like t1)
location (
'gpfdist://etl1:8080/data1.txt',
'gpfdist://etl1:8081/data2.txt'
)
format 'text';

insert into ext_demotable
select col1, col2
from demotable
where create_date >= current_date - interval '1 month';

drop external table ext_demotable;

3. Kill gpfdist processes
Kill the gpfdist processes on the ETL server with the process ids saved from step 1.

4. Continue with your data extraction 
The needed data has been extracted in files /data1/projectdir/data1.txt and /data1/projectdir/data2.txt in tab delimited format. These files can now be processed as needed by the application.


Issues that need to be resolved:

1. Starting and stopping gpfdist processes with port numbers need to be scripted with failure checks. The relevant port numbers need to be used in creation of the external tables.

2. The database login extracting data needs to be able to create writable external table in some schema (public is default) in the database. We can possibly create utility stored functions to bypass this issue.

3. There could be name conflict while projects create and drop external tables (two simultaneous extracts using the same name). We have to create naming standards or use utility stored functions to handle this.

4. The project must be able to process delimited and/or csv files. Note: gpfdist formatter can be used to create files in custom formats.

5. If FIFO pipes are used instead of files as the extraction target, the process that reads (consumes) the extracted data needs to be fast as well, otherwise the overall speed of the process will degrade to the speed of the consumer application. So, if gpfdist can download data at 25GB per minute, but the application can only process this data at 1GB per minute, the overall process (SQL statement) will work at the speed of 1GB per minute.

Comment: FIFO is not encouraged for large dataset extraction; it may be bottle-necked by the downstream 1 Gb network; just like ODBC.

6. If actual files are used as extraction target, the extractor process must be granted enough disk quotas to extract the entire data.

Comments