We have migrated to new Google sites!

Performance testing for PWX for GP

Post date: Sep 09, 2014 7:13:8 PM

Problem: If I load one file to a table (1 single Informatica session, 1 GP connection) the load takes about 7 minutes. Now if change and do say 15 loads all to that same table via 15 individual Informatica sessions each with their own external table, the same volume of data for each session and a GP connection for each session, the load time jumps to about 15-20 minutes. Prior to the loads, I truncate the target table manually and then start the loads. Now once that first set of loads completes, if I repeat the 15 loads without truncating the loads complete in 7 minutes just as a single load would. What would cause the original 15 to take so much longer? There is no vacuum or analyze after the loads, just a data dump and commit.

SELECT relation::regclass, * FROM pg_locks WHERE NOT granted;

------------------+----------+----------+----------+------+-------+---------------+---------+-------+----------+-------------+------+--------------------------+---------+--------------+-------------+---------------

salesdata | relation | 21635 | 710899 | | | | | | | 5038006 | 4076 | ShareUpdateExclusiveLock | f | 178315 | t | -1

salesdata | relation | 21635 | 710899 | | | | | | | 5038012 | 4117 | ShareUpdateExclusiveLock | f | 178316 | t | -1

salesdata | relation | 21635 | 710899 | | | | | | | 5038015 | 4140 | ShareUpdateExclusiveLock | f | 178317 | t | -1

salesdata | relation | 21635 | 710899 | | | | | | | 5038017 | 4169 | ShareUpdateExclusiveLock | f | 178319 | t | -1

salesdata | relation | 21635 | 710899 | | | | | | | 5038016 | 4152 | ShareUpdateExclusiveLock | f | 178318 | t | -1

Solution:Greenplum has an auto-analyze feature that kicks of when data is inserted into a table the first time (after create or truncate). This feature is controlled by the setting gp_autostats_mode which is typically set to (ON_NO_STATS).

Analyzing a table is a fairly costly operation (multiple SQLs are run per column of the table) and the analyze acquires a ShareUpdateExclusiveLock on the table which makes sure that multiple analyze statements on a single table cannot run in parallel. Also, since analyze only samples the table, the speed of the analyze is not impacted so much by the number of rows as it is impacted by the number of columns and partitions.

Now, think of fifteen parallel initial inserts. Greenplum will assign all fifteen of them the "analyze after insert" task because the table was empty before each of the inserts. After the insert itself which should finish quickly, all fifteen threads will analyze the table. Because of the ShareUpdateExclusiveLock, the analyze operations will run serially, not in parallel. This will cause the operation to slow down a lot.

In case of non-parallel inserts, one insert and one analyze operation will happen, which is faster.

In case of subsequent loads, the table is non-empty, so the analyze operation is not triggered at all.

Hope the above explains the situation you faced. Given this, I would recommend the following:

1. Try to not parallelize load (insert, update, delete) operations to Greenplum. The MPP system already runs many parallel threads for you for a single statement, so it is often not effective to implement external parallelization. I believe Informatica has an option to parallelize its internal operations while still submitting a single load operation to Greenplum.

2. If for some reason, such parallel operations are not avoidable (they should always be, I hope), set the GP_AUTOSTATS_MODE to NONE and analyze the table after all load threads are complete. GP_AUTOSTATS_MODE can be changed at session level, but it would be more challenging to achieve it through PWX (gpload). Setting GP_AUTOSTATS_MODE at global level might impact other users who may be expecting the auto-analyze behavior.

Report abuse