Query Profiling in Greenplum

Greenplum Database devises a query plan for each query. Choosing the right query plan to match the query and data structure is necessary for good performance. A query plan defines how Greenplum Database will run the query in the parallel execution environment. Examine the query plans of poorly performing queries to identify possible performance tuning opportunities.

The query planner uses data statistics maintained by the database to choose a query plan with the lowest possible cost. Cost is measured in disk I/O, shown as units of disk page fetches. The goal is to minimize the total execution cost for the plan.

View the plan for a given query with the EXPLAIN command. EXPLAIN shows the query planner’s estimated cost for the query plan. For example:

EXPLAIN SELECT * FROM names WHERE id=22;

EXPLAIN ANALYZE runs the statement in addition to displaying its plan. This is useful for determining how close the planner’s estimates are to reality. 

For example:

EXPLAIN ANALYZE SELECT * FROM names WHERE id=22;

Reading EXPLAIN Output

A query plan is a tree of nodes. Each node in the plan represents a single operation, such as a table scan, join, aggregation, or sort.

Read plans from the bottom to the top: each node feeds rows into the node directly above it. The bottom nodes of a plan are usually table scan operations: sequential, index, or bitmap index scans. If the query requires joins, aggregations, sorts, or other operations on the rows, there are additional nodes above the scan nodes to perform

these operations. The topmost plan nodes are usually Greenplum Database motion nodes: redistribute, explicit redistribute, broadcast, or gather motions. These operations move rows between segment instances during query processing.

The output of EXPLAIN has one line for each node in the plan tree and shows the basic node type and the following execution cost estimates for that plan node:

1. cost: Measured in units of disk page fetches. 1.0 equals one sequential disk page read. The first estimate is the start-up cost of getting the first row and the second is the total cost of cost of getting all rows. The total cost assumes all rows will be retrieved, which is not always true; for example, if the query uses LIMIT, not all rows are retrieved.

2. rows: The total number of rows output by this plan node. This number is usually less than the number of rows processed or scanned by the plan node, reflecting the estimated selectivity of any WHERE clause conditions. Ideally, the estimate for the topmost node approximates the number of rows that the query actually returns, updates, or deletes.

3. width: The total bytes of all the rows that this plan node outputs.

Note the following:

1. The cost of a node includes the cost of its child nodes. The topmost plan node has the estimated total execution cost for the plan. This is the number the planner intends to minimize.

2. The cost reflects only the aspects of plan execution that the query planner takes into consideration. For example, the cost does not reflect time spent transmitting result rows to the client.