Maximizing load/unload performance by optimizing gpfdist performance

Post date: Aug 25, 2014 7:14:21 PM

Gpfdist is a file server program using a HTTP protocol that serves files in parallel and provides the best performance when loading or unloading data in Greenplum database. Primary segments access external files in parallel when using gpfdist up to the value of gp_external_max_segments GUC. 

In general, when optimizing gpfdist performance, maximize the parallelism as the number of segments increases. Spread the data evenly across as many nodes as possible. Split very large data files into equal parts and spread the data across as many file systems as possible. Run two gpfdist‘s per file system. Gpfdist tends to be CPU bound on the segment nodes when loading, but for example, if there are 8 racks of segment nodes, there is lot of available CPU on the segment side so you 

can drive more gpfdist processes. Run gpfdist on as many interfaces as possible (and be aware of bonded NICs and be sure to start enough gpfdist’s to work them). It is important to keep the work even across all these resources. In an MPP shared nothing environment, the load is as fast as the slowest node. Skew in the load file layout will cause the overall load to bottleneck on that resource.

gp_external_max_segs controls the number segments each gpfdist serves. The default is 64. Always keep gp_external_max_segs and the number of gpfdist processes and even factor (gp_external_max_segs divided by the # of gpfdist processes should have a 0 remainder). The way this works, for example if there are 12 segments and 4 gpfdist’s then the planner round robins the assignment as follows:

Seg 1 - gpfdist 1

Seg 2 - gpfdist 2

Seg 3 - gpfdist 3

Seg 4 - gpfdist 4

Seg 5 - gpfdist 1

Seg 6 - gpfdist 2

Seg 7 - gpfdist 3

Seg 8 - gpfdist 4

Seg 9 - gpfdist 1

Seg 10 - gpfdist 2

Seg 11 - gpfdist 3

Seg 12 - gpfdist 4

Remember to drop indexes before loading into existing tables and re-create the index after loading. Creating an index on pre-existing data is faster than updating it incrementally as each row is loaded. Always run ANALYZE after loading and if the load significantly alters the table data run VACUUM ANALYZE. Disable automatic statistics collection during loading by setting the gp_autostats_mode GUC to NONE. 

Run VACUUM after load errors to recover space.