ETL Hosts for Data Loading

Post date: Sep 21, 2012 2:12:14 PM

Q. How can we achieve maximum parallelism and load bandwidth from Greenplum Database system?

Ans: Greenplum supports fast, parallel data loading with its external tables feature. By using external tables in conjunction with Greenplum Database’s parallel file server (gpfdist), administrators can achieve maximum parallelism and load bandwidth from their Greenplum Database system. Many production systems deploy designated ETL servers for data loading purposes. These machines run the Greenplum parallel file server (gpfdist), but not Greenplum Database instances.

One advantage of using the gpfdist file server program is that it ensures that all of the segments in your Greenplum Database system are fully utilized when reading from external table data files.

The gpfdist program can serve data to the segment instances at an average rate of about 350 MB/s for delimited text formatted files and 200 MB/s for CSV formatted files. Therefore, you should consider the following options when running gpfdist in order to maximize the network bandwidth of your ETL systems:

1. If your ETL machine is configured with multiple network interface cards (NICs), run one instance of gpfdist on your ETL host and then define your external table definition so that the host name of each NIC is declared in the LOCATION clause. This allows network traffic between your Greenplum segment hosts and your ETL host to use all NICs simultaneously.

2. Run multiple gpfdist instances on your ETL host and divide your external data files equally between each instance. For example, if you have an ETL system with two network interface cards (NICs), then you could run two gpfdist instances on that machine to maximize your load performance. You would then divide the external table data files evenly between the two gpfdist programs.