Loading and Unloading data into Greenplum database

Greenplum provides following 3 tools for loading and unloading data.

1. Using External Tables

2. Using gpload

3. Using COPY

Using EXTERNAL TABLES

External tables allow you to access external files as though they were regular database tables.Used in conjunction with Greenplum’s parallel file distribution program (gpfdist), external tables provide full parallelism by utilizing the resources of all Greenplum segments when loading or unloading data. Greenplum Database also leverages the parallel architecture of the Hadoop Distributed File System to access files on that system.

Greenplum Database provides two types of external tables

1. Readable external tables for data loading

2. Writable external tables for data unloading

External tables are either file based or web based.  Either type of external table can be readable or writable. 

Readable external tables provide an easy way to perform basic extraction, transformation, and external table loading (ETL) tasks that are common in data warehousing.

External table data is read in parallel by all Greenplum Database segment instances, so large load operations are processed as quickly as possible. Once the external table is defined, you can query the data directly and in parallel using SQL commands.

You can, for example, select, join, or sort external table data. You can also create views for external tables. However, readable external tables cannot be modified. DML operations (UPDATE, INSERT, DELETE, or TRUNCATE) are not allowed.

There are two types of readable external tables – regular and web.

The main difference between regular external tables and web external tables are their data sources. Regular external tables access static flat files, while web external tables access dynamic data sources – either on a web server with the http:// protocol or by executing OS commands or scripts.

When a query is planned using a regular external table, the external table is considered rescannable since the data is static for the course of the query. For web external tables, the data is not rescannable because there is the possibility that the data could change during the course of the execution of the query.

Writable external tables are used to select rows from other database tables and output the rows to files, named pipes, or to other executable programs. For example, you could unload data from Greenplum Database and send it to an executable that connects to another database or ETL tool to load the data elsewhere. Writable external tables can also be used as output targets for Greenplum parallel MapReduce calculations.

Once a writable external table is defined, data can be selected from database tables and inserted into the writable external table. Writable external tables only allow INSERT operations – SELECT, UPDATE, DELETE, or TRUNCATE are not allowed.Writable web external tables output data to an executable program that can accept an input stream of data.

Using gpload

gpload is a data loading utility that acts as an interface to Greenplum’s external table parallel loading feature. Using a load specification defined in a YAML formatted control file, gpload executes a load by invoking the Greenplum parallel file server program (gpfdist), creating an external table definition based on the source data defined, and executing an INSERT, UPDATE or MERGE operation to load the source data into the target table in the database.

Using  COPY

Greenplum Database also offers the standard PostgreSQL COPY command for loading and unloading data. COPY is a non-parallel load/unload mechanism, meaning that data is loaded/unloaded in a single process via the Greenplum master instance. In cases where a small amount of data is involved, COPY offers a simple way to move data into or out of the database in a single transaction, without the administrative overhead of setting up an external table.