Reading EXPLAIN ANALYZE Output in Greenplum

Post date: Oct 10, 2013 10:40:24 PM

EXPLAIN ANALYZE causes the statement to be actually executed, not only planned. The EXPLAIN ANALYZE plan shows the actual results along with the planner’s estimates. This is useful for seeing whether the planner’s estimates are close to reality. 

In addition to the information shown in the EXPLAIN plan, EXPLAIN ANALYZE will show the following additional information:

1. The total elapsed time (in milliseconds) that it took to run the query.

2. The number of workers (segments) involved in a plan node operation. Only segments that return rows are counted.

3. The maximum number of rows returned by the segment that produced the most rows for an operation. If multiple segments produce an equal number of rows, the one with the longest time to end is the one chosen.

4.. The segment id number of the segment that produced the most rows for an operation. 

5. The time (in milliseconds) it took to retrieve the first row from the segment that produced the most rows, and the total time taken to retrieve all rows from that segment. The <time> to first row may be omitted if it is the same as the <time> to end.

EXPLAIN ANALYZE Example

To illustrate how to read an EXPLAIN ANALYZE query plan, we will use the same simple query we used in the “EXPLAIN Example” in my previous blog. Notice that there is some additional information in this plan that is not in a regular EXPLAIN plan. The parts of the plan in bold show the actual timing and rows returned for each plan node:

sachi=> EXPLAIN ANALYZE select * from employees where employee_id=198;

                                                QUERY PLAN                                                 

-----------------------------------------------------------------------------------------------------------

 Gather Motion 1:1  (slice1; segments: 1)  (cost=0.00..3.34 rows=1 width=85)

   Rows out:  1 rows at destination with 1.492 ms to first row, 1.493 ms to end, start offset by 9.610 ms.

   ->  Seq Scan on employees  (cost=0.00..3.34 rows=1 width=85)

         Filter: employee_id = 198::numeric

         Rows out:  1 rows with 0.227 ms to first row, 0.242 ms to end, start offset by 11 ms.

 Slice statistics:

   (slice0)    Executor memory: 183K bytes.

   (slice1)    Executor memory: 201K bytes (seg1).

 Statement statistics:

   Memory used: 128000K bytes

 Total runtime: 11.308 ms

(11 rows)

sachi=> 

If we read the plan from the bottom up, you will see some additional information for each plan node operation. The total elapsed time it took to run this query was 11.308 milliseconds.

The sequential scan operation had only one segment (seg0) that returned rows, and it  returned just 1 row. It took 0.227 milliseconds to find the first row and 0.242 to scan all rows. 

Notice that this is pretty close to the planner’s estimate — the query planner estimated that it would return one row for this query, which it did. The gather motion

operation then received 1 row (segments sending up to the master). The total elapsed time for this operation was 1.493 milliseconds.

Interpretation of Explain Plan (basic) – Explained

NOTE : This is just a basic/simple document on understanding EXPLAIN PLAN, will not cover any advanced feature 

By definition EXPLAIN PLAN states " The command that displays the execution plan that the database planner generates for the supplied statement. The execution plan shows how the table(s) referenced by the statement will be scanned — by plain sequential scan, index scan, etc. — and if multiple tables are referenced, what join algorithms will be used to bring together the required rows from each input table. "

So in short it means its the best path determined by the planner to run the query , based on the latest statistics and the parameter values ( like enable_seqscan , enable_hashjoin etc ).

So lets start with a simple example and break it down to what the explain plan helps or wants you to understand.

Explain 

Example 1

Lets have a look at the simple example.

sachi=# explain select * from pg_class;

QUERY PLAN

---------------------------------------------------------------

Seq Scan on pg_class (cost=0.00..40.37 rows=2359 width=204)

(1 row)

In the above example.

The above query with return 2359 rows with sequential scan ( full table scan ) on pg_class, this is just a approximate value might not be the exact row in the table , since the analyze 

is done based on row samples. Each rows with take 204 bytes on average , i.e each row returned will be of 204 bytes. 

Cost that is spend for the first row is 0 ( not exactly zero but close to zero ) 

To get the entire rows ( i.e 2359 ) it will cost 40.37.

Now the cost basically tells you "How much work it will require in reading a single database page from disk" , since its not easily measurable , you can assume it will take 40.37 units to 

complete the work.

So what did the planner / optimizer do here to get the cost of 40.37 ?

Let break it down again

When you ran the query , it called by a function estimation tool called "seq scan" ( in the above example ) .  This guy ( function ) then checked in the statistics information of the table ( pg_class in above example ) of the available statistics for the table in the catalog (pg_class) table (namely tuples / pages) 

Based on the statistics it made a calculation on the amount of work it has to do to complete the job.

So to explain what tuples / pages has an effect here , lets take in the current values 

sachi=# select relname,relpages,reltuples from pg_class where relname='pg_class';

relname | relpages | reltuples

----------+----------+-----------

pg_class | 17 | 2339

(1 row)

Time: 1.272 ms

So the estimated cost here is 

(disk pages read * seq_page_cost ) + (tuples scanned * cpu_tuple_cost ) 

Where , default 

seq_page_cost = 1.0

cpu_tuple_cost = 0.01

so the estimated cost = ( 17 * 1.0 ) + ( 2339 * 0.01 ) = 40.39

So this helps you understand why statistics are really imported for the planner to make in correct decision.

Example 2 :

Now lets take it a little further , by introducing a where condition. 

sachi=# explain select * from pg_class where relname='pg_type'; 

QUERY PLAN

------------------------------------------------------------

Seq Scan on pg_class (cost=0.00..46.21 rows=24 width=238)

Filter: relname = 'pg_type'::name

(2 rows)

Now the where condition here was able to reduce the amount of rows. But the cost now has been increased from 40 to 46 . Now when you introduce the where condition , the sequential scan still has to scan the entire rows of 2339 , then apply the where condition on top of it . so now there is a additional cost in the form of cpu_operator_cost ( default value 0.0025 ) ..

so addition of this extra cost the equation becomes 

(disk pages read * seq_page_cost ) + (tuples scanned * cpu_tuple_cost ) + (tuples scanned * cpu_operator_cost ) = 40.39 + (2339 * 0.0025 ) = 40.39 + 5.8 = 46.2 

Similarly planner has to take into consideration other cost such as random_page_cost , cpu_index_tuple_cost , effective_cache_size etc , based on its execution path.

Example 3

Now let's add a sort key into the picture.

sachi=# explain select * from pg_class where relname='pg_class' order by relname;

QUERY PLAN

------------------------------------------------------------------

Sort (cost=46.79..46.85 rows=24 width=238)

Sort Key: relname

-> Seq Scan on pg_class (cost=0.00..46.24 rows=24 width=238)

Filter: relname = 'pg_class'::name

(4 rows)

Time: 3.445 ms

When you have branches of explain plan , it's always read from the last line in the branch to the top , i.e in the above example "seq scan" was the first step and then the data was passed onto "sort". And the total cost of those steps includes the branches that are below it , let me explain in details below.

So here you can witness:

Sort gets the data from seq scan 

And then sort is done with the sort key column "relname" 

The cost of the sort is too high for the first value and the total work done is almost the same has the first value

Reason: The sort only returns values when the work is actually done, i.e. it has to scan the entire rows received from "seq scan" and then sort them up and once done, it returns the first value. So the total cost of the sort operation to get the first row is 46.79 ( sort ) - 46.24 (seq scan ) = 0.55 

The total cost of the sort operation to do its work was 46.85 - 46.24 = 0.61 

The reason why I subtracted the total work done by "seq scan" from sort scan was , it actually waited for seq scan to do its work and pass on the entry to sort so that it can start of the 

work , so in general the sort was not the one here that actually did more work based on the cost , it was "seq scan"

Example 4 

Let's introduce some joins:

sachi=# explain select * from test_emc join test_emc2 using (a); 

QUERY PLAN

-------------------------------------------------------------------------------------

Gather Motion 24:1 (slice1; segments: 24) (cost=690.04..2310.07 rows=834 

width=4)

-> Hash Join (cost=690.04..2310.07 rows=834 width=4)

Hash Cond: test_emc.a = test_emc2.a

-> Seq Scan on test_emc (cost=0.00..1120.00 rows=4167 width=4)

-> Hash (cost=440.02..440.02 rows=834 width=4)

-> Seq Scan on test_emc2 (cost=0.00..440.02 rows=834 width=4)

Here The Hash Join is the main branch , which is waiting for "seq scan from test_emc" and "hash operation" message. 

The Hash operation is waiting for again a "seq scan" from table test_emc2. So here the inner most is "seq scan on test_emc2" , this scan is passed on to hash operation. 

The hash operation first and total work is the same ( equal to the seq scan of test_emc2 total work ) , because the hash operation will send out information once it receive the full 

information from branches below it. 

Once the hash operation is done its passed on to Hash join. 

Seq scan also pass this information to the hash join. 

So the work done by Hash join here is ( 690.04 - 0 - 440.02 ) for the first row, since the hash join start sending out the rows to the main source as soon as it get the rows from its 

branches below. and (2310.07 - 1120 - 440.02 ) for the total work.

EXPLAIN ANALYZE

The only difference between a EXPLAIN and EXPLAIN ANALYZE is the former doesn't execute the query and gives a cost on estimates and the latter run the query and gives the actual 

results.

So lets analyze and few example of EXPLAIN ANALYZE.

Example 1

sachi=# explain analyze select * from test_emc;

QUERY PLAN

-----------------------------------------------------------------------------------------------------------------------------------------------

Gather Motion 24:1 (slice1; segments: 24) (cost=0.00..1120.00 rows=4167 width=4)

Rows out: 100000 rows at destination with 0.891 ms to first row, 195 ms to end, start offset by 0.361 ms.

-> Seq Scan on test_emc (cost=0.00..1120.00 rows=4167 width=4)

Rows out: Avg 4166.7 rows x 24 workers. Max 4187 rows (seg7) with 0.220 ms to first row, 1.738 ms to end, start offset by 1.470 ms.

Slice statistics:

(slice0) Executor memory: 266K bytes.

(slice1) Executor memory: 156K bytes avg x 24 workers, 156K bytes max 

(seg0).

Statement statistics:

Memory used: 128000K bytes

Total runtime: 208.977 ms

(10 rows)

Here ,

The EXPLAIN tells you that if you run the query with the present data it will take 208.977 ms to complete the entire work. It will use up approx 128M of RAM 

There were two slices used slice 0 (sender) and a slice 1 (receiver) 

A slice is some thing where the original plans ( made by the planner ) are spilled into small plans and dispatched to the segments ( workers ) , for more information in slices refer to the link. 

slice 1 has produced 24 worker process on 24 segments each worker taking about 156K on avg , and the max was taken by seg0 which is 156k. 

slice 0 took about 266k memory. 

The query ran "seq scan" , where 24 workers returning 4166.7 rows on avg , the max was returned by set 7 which is 4187 rows. 

The "seq scan" took 0.220 ms for the first row and 1.738 ms to complete the task and took 1.470 ms to plan the execution of the query. 

The "seq scan" returned 100000 rows ( 4166.7 * 24 ) to the destination ( Gather Motion , which is the master ) with .89 ms to get the first row and the entire rows were received ( 

from all the segments ) at 195 ms and 0.361 ms to plan the start of accumulating the received rows.

Now Gather Motion is always the last step in the explain plan.

A motion is referred to the data ( tuples ) moving across interconnect , that is it (master) has to interact with the segments to get the data . Catalog data that resides on the master doesn't not need a motion.

There are basically three type of motions.

Gather Motion (N:1) - Every segment workers sends the target data to a single node (usually the master), which is then passed onto the end user. 

Redistribute Motion (N:N) - Every segment workers rehashes the target data (by join column) and redistributes each row to the appropriate segment. 

Broadcast Motion (N:N) - Every segment sends the target data to all other segments 

So here Gather Motion (24:1) means there are 24 worker processes which send data and one process to receive them.

Example 2

Let's take a example with join:

sachi=# explain analyze select * from test_emc join test_emc2 using (a);

QUERY PLAN

-------------------------------------------------------------------------------------------------------------------------------------------------------

Gather Motion 24:1 (slice1; segments: 24) (cost=690.04..2310.07 rows=834 width=4)

Rows out: 20002 rows at destination with 45 ms to first row, 558 ms to end, start offset by 0.444 ms.

-> Hash Join (cost=690.04..2310.07 rows=834 width=4)

Hash Cond: test_emc.a = test_emc2.a

Rows out: Avg 833.4 rows x 24 workers. Max 862 rows (seg7) with 430 ms to first row, 431 ms to end, start offset by 1.580 ms.

Executor memory: 20K bytes avg, 21K bytes max (seg7).

Work_mem used: 20K bytes avg, 21K bytes max (seg7).

(seg7) Hash chain length 2.0 avg, 2 max, using 431 of 524341 buckets.

-> Seq Scan on test_emc (cost=0.00..1120.00 rows=4167 width=4)

Rows out: Avg 4166.7 rows x 24 workers. Max 4187 rows (seg7) with 0.090 ms to first row, 1.287 ms to end, start offset by 1.583 ms.

-> Hash (cost=440.02..440.02 rows=834 width=4)Rows in: Avg 833.4 rows x 24 workers. Max 862 rows (seg7) with 4.252 ms to end, start offset by 425 ms.

-> Seq Scan on test_emc2 (cost=0.00..440.02 rows=834 width=4)

Rows out: Avg 833.4 rows x 24 workers. Max 862 rows 

(seg7) with 2.314 ms to first row, 3.895 ms to end, start offset by 425 ms.

Slice statistics:

(slice0) Executor memory: 311K bytes.

(slice1) Executor memory: 8457K bytes avg x 24 workers, 8457K bytes max 

(seg0). Work_mem: 21K bytes max.

Statement statistics:

Memory used: 128000K bytes

Total runtime: 561.475 ms

(20 rows)

Time: 562.537 ms

Here most the stuff should be self explanatory if you have read my above example , only stuff to talk about is the line (seg7) Hash chain length 2.0 avg, 2 max, using 431 of 524341 buckets. 

I would not be describing much about this , if you know about hash table it basically means hash table has distributed the data in the form of buckets and here it has used 431 buckets out of 524341 ( read more on hash table from the link ) and each of length 2 on avg ( read on hash chain length at link).