Greenplum Database Performance Tuning

Performance Tuning Approach

DISTRIBUTED BY must be balanced. That is, follow its Data Types as well as columns which either came from a Clustering Index or have been selected as UNIQUE NOT NULL values matching all other tables to make the LOCAL JOIN available

Only build Partitions when these are truly necessary. If the distributions give you the results you need in terms of response time then prevent the creation of a Partition

Indexing is a bad word in Greenplum. Although supported, this is the last resort to a inefficiently written queries which usually have to do to an unbalanced distribution selection and/or some other related predicate that should have been coded (like the distributions are not being used in the JOIN)

Based on Greenplum’s Architecture:

The query planner uses the database statistics it has to choose a query plan with the lowest possible cost. Cost is measured in disk I/O and CPU effort (shown as units of disk page fetches). The goal is to minimize the total execution cost for the plan. You can view the plan for a given query using the EXPLAIN command. This will show the query planner’s estimated plan for the query. For example:

sachi=> select * from employees where employee_id=198;

 employee_id | first_name | last_name |  email   | phone_number |      hire_date      |  job_id  | salary  | commission_pct | manager_id | department_id 

-------------+------------+-----------+----------+--------------+---------------------+----------+---------+----------------+------------+---------------

         198 | Donald     | OConnell  | DOCONNEL | 650.507.9833 | 2007-06-21 00:00:00 | SH_CLERK | 2600.00 |                |        124 |            50

(1 row)

sachi=> 

EXPLAIN ANALYZE causes the statement to be actually executed, not only planned. This is useful for seeing whether the planner’s estimates are close to reality. For example:

sachi=> EXPLAIN ANALYZE select * from employees where employee_id=198;

                                                QUERY PLAN                                                 

-----------------------------------------------------------------------------------------------------------

 Gather Motion 1:1  (slice1; segments: 1)  (cost=0.00..3.34 rows=1 width=85)

   Rows out:  1 rows at destination with 1.252 ms to first row, 1.253 ms to end, start offset by 8.761 ms.

   ->  Seq Scan on employees  (cost=0.00..3.34 rows=1 width=85)

         Filter: employee_id = 198::numeric

         Rows out:  1 rows with 0.147 ms to first row, 0.162 ms to end, start offset by 9.741 ms.

 Slice statistics:

   (slice0)    Executor memory: 183K bytes.

   (slice1)    Executor memory: 201K bytes (seg1).

 Statement statistics:

   Memory used: 128000K bytes

 Total runtime: 10.188 ms

(11 rows)

sachi=> 

SELECT in Greenplum - Nov 14, 2014 1:48:1 PM

Get faster vacuums - Nov 06, 2014 1:30:27 AM

GUCs for Index Selection - Oct 31, 2014 1:11:18 PM

Data Types to use in a Distribution Key - Oct 31, 2014 11:43:4 AM

Performance testing for PWX for GP - Sep 09, 2014 7:13:8 PM

Analyzing Query plans - Aug 25, 2014 8:47:33 PM

DISK SPILL - Aug 25, 2014 7:39:54 PM

EXPLAIN AND EXPLAIN ANALYZE - Aug 25, 2014 7:27:2 PM

ANALYZE - Aug 25, 2014 7:20:9 PM

VACUUM Regularly - Aug 25, 2014 7:17:17 PM

Using Indexes to improve performance - Aug 25, 2014 7:7:44 PM

Computational Skew vs Data Skew - Aug 25, 2014 6:55:51 PM

Greenplum Database Storage Model - Aug 25, 2014 6:11:11 PM

Optimizing greenplum database design - Mar 12, 2014 11:41:44 PM

What is Free Space Map? How to size it? - Jan 24, 2014 12:42:17 AM

Checking Query Disk Spill Space Usage - Jan 21, 2014 1:52:17 AM

How to reclaim all expired row space - Oct 17, 2013 3:54:53 PM

Maintaining database statistics - Oct 10, 2013 11:26:13 PM

What to Look for in a Query Plan - Oct 10, 2013 10:50:1 PM

Reading the query plan in Greenplum - Oct 10, 2013 6:48:26 PM