Greenplum Database Performance Tuning
Performance Tuning Approach
Data Types / Byte Alignment
Distribution Analysis
Partitioning Analysis
Indexing Strategies (if any)
Explain Plans
DISTRIBUTED BY must be balanced. That is, follow its Data Types as well as columns which either came from a Clustering Index or have been selected as UNIQUE NOT NULL values matching all other tables to make the LOCAL JOIN available
Only build Partitions when these are truly necessary. If the distributions give you the results you need in terms of response time then prevent the creation of a Partition
Indexing is a bad word in Greenplum. Although supported, this is the last resort to a inefficiently written queries which usually have to do to an unbalanced distribution selection and/or some other related predicate that should have been coded (like the distributions are not being used in the JOIN)
Be constant with the execution of ANALYZE for tables that only get INSERT executed against them
Help your DBA find the Distribution Mismatch to help you make your processing work faster
Greenplum Database devises a query plan for each query it is given.
Choosing the right query plan to match the query and data structure is absolutely critical for good performance.
A query plan defines how the query will be executed in Greenplum Database’s parallel execution environment.
By examining the query plans of poorly performing queries, you can identify possible performance tuning opportunities.
Based on Greenplum’s Architecture:
Distribute by JOIN (For LOCAL JOIN Practice)
Partition by Predicate when truly necessary
Index only when truly necessary (not encourage)
As a DBA you must analyze and make sure that all of the tables associated with one another do posses the same DISTRIBUTION KEY set of components (or columns)
A LOCAL JOIN is that join between two or more tables that share the same Distribution column values, just as a Clustering Index does in a conventional RDBMS.
A LOCAL JOIN executes faster than a conventional Clustering INDEX.
The query planner uses the database statistics it has to choose a query plan with the lowest possible cost. Cost is measured in disk I/O and CPU effort (shown as units of disk page fetches). The goal is to minimize the total execution cost for the plan. You can view the plan for a given query using the EXPLAIN command. This will show the query planner’s estimated plan for the query. For example:
sachi=> select * from employees where employee_id=198;
employee_id | first_name | last_name | email | phone_number | hire_date | job_id | salary | commission_pct | manager_id | department_id
-------------+------------+-----------+----------+--------------+---------------------+----------+---------+----------------+------------+---------------
198 | Donald | OConnell | DOCONNEL | 650.507.9833 | 2007-06-21 00:00:00 | SH_CLERK | 2600.00 | | 124 | 50
(1 row)
sachi=>
EXPLAIN ANALYZE causes the statement to be actually executed, not only planned. This is useful for seeing whether the planner’s estimates are close to reality. For example:
sachi=> EXPLAIN ANALYZE select * from employees where employee_id=198;
QUERY PLAN
-----------------------------------------------------------------------------------------------------------
Gather Motion 1:1 (slice1; segments: 1) (cost=0.00..3.34 rows=1 width=85)
Rows out: 1 rows at destination with 1.252 ms to first row, 1.253 ms to end, start offset by 8.761 ms.
-> Seq Scan on employees (cost=0.00..3.34 rows=1 width=85)
Filter: employee_id = 198::numeric
Rows out: 1 rows with 0.147 ms to first row, 0.162 ms to end, start offset by 9.741 ms.
Slice statistics:
(slice0) Executor memory: 183K bytes.
(slice1) Executor memory: 201K bytes (seg1).
Statement statistics:
Memory used: 128000K bytes
Total runtime: 10.188 ms
(11 rows)
sachi=>
SELECT in Greenplum - Nov 14, 2014 1:48:1 PM
Limiting the CPU and Priority of a job - Nov 06, 2014 2:24:3 AM
Get faster vacuums - Nov 06, 2014 1:30:27 AM
GUCs for Improving Short Running Queries - Oct 31, 2014 1:26:36 PM
GUCs for Index Selection - Oct 31, 2014 1:11:18 PM
Data Types to use in a Distribution Key - Oct 31, 2014 11:43:4 AM
Data Types to Avoid in a DISTRIBUTION KEY - Oct 31, 2014 11:40:27 AM
The SQL to review distribution across all segments - Oct 31, 2014 2:8:24 AM
Performance testing for PWX for GP - Sep 09, 2014 7:13:8 PM
Analyzing Query plans - Aug 25, 2014 8:47:33 PM
DISK SPILL - Aug 25, 2014 7:39:54 PM
EXPLAIN AND EXPLAIN ANALYZE - Aug 25, 2014 7:27:2 PM
ANALYZE - Aug 25, 2014 7:20:9 PM
VACUUM Regularly - Aug 25, 2014 7:17:17 PM
Maximizing load/unload performance by optimizing gpfdist performance - Aug 25, 2014 7:14:21 PM
Using Indexes to improve performance - Aug 25, 2014 7:7:44 PM
Maximizing Performance using partitioning - Aug 25, 2014 7:1:6 PM
Computational Skew vs Data Skew - Aug 25, 2014 6:55:51 PM
Data distributions, Data Skew and local joins - Aug 25, 2014 6:46:28 PM
Greenplum Database Storage Model - Aug 25, 2014 6:11:11 PM
Choose data types that use the least possible space - Aug 25, 2014 6:2:44 PM
Optimizing greenplum database design - Mar 12, 2014 11:41:44 PM
What is Free Space Map? How to size it? - Jan 24, 2014 12:42:17 AM
Understanding Greenplum Database Performance Factors - Jan 23, 2014 6:33:45 PM
How to determine acceptable performance in Greenplum - Jan 23, 2014 3:48:16 PM
Monitoring Greenplum Database System Resources Utilization - Jan 23, 2014 3:7:6 PM
Most common causes of performance issues in Greenplum and its solution - Jan 21, 2014 8:35:58 PM
Checking Query Disk Spill Space Usage - Jan 21, 2014 1:52:17 AM
How to reclaim all expired row space - Oct 17, 2013 3:54:53 PM
Tuning Statistics Collection in Greenplum Database - Oct 11, 2013 1:11:59 PM
Maintaining database statistics - Oct 10, 2013 11:26:13 PM
What to Look for in a Query Plan - Oct 10, 2013 10:50:1 PM
Reading EXPLAIN ANALYZE Output in Greenplum - Oct 10, 2013 10:40:24 PM
Reading the query plan in Greenplum - Oct 10, 2013 6:48:26 PM