Greenplum Database Performance Tuning

Performance Tuning Approach

1. Data Types / Byte Alignment
2. Distribution Analysis
3. Partitioning Analysis
4. Indexing Strategies (if any)
5. Explain Plans

DISTRIBUTED BY must be balanced. That is, follow its Data Types as well as columns which either came from a Clustering Index or have been selected as UNIQUE NOT NULL values matching all other tables to make the LOCAL JOIN available

Only build Partitions when these are truly necessary. If the distributions give you the results you need in terms of response time then prevent the creation of a Partition

Indexing is a bad word in Greenplum. Although supported, this is the last resort to a inefficiently written queries which usually have to do to an unbalanced distribution selection and/or some other related predicate that should have been coded (like the distributions are not being used in the JOIN)

1. Be constant with the execution of ANALYZE for tables that only get INSERT executed against them
2. Help your DBA find the Distribution Mismatch to help you make your processing work faster
3. Greenplum Database devises a query plan for each query it is given.
4. Choosing the right query plan to match the query and data structure is absolutely critical for good performance.
5. A query plan defines how the query will be executed in Greenplum Database’s parallel execution environment.
6. By examining the query plans of poorly performing queries, you can identify possible performance tuning opportunities.

Based on Greenplum’s Architecture:

1. Distribute by JOIN (For LOCAL JOIN Practice)
2. Partition by Predicate when truly necessary
3. Index only when truly necessary (not encourage)
4. As a DBA you must analyze and make sure that all of the tables associated with one another do posses the same DISTRIBUTION KEY set of components (or columns)
5. A LOCAL JOIN is that join between two or more tables that share the same Distribution column values, just as a Clustering Index does in a conventional RDBMS.
6. A LOCAL JOIN executes faster than a conventional Clustering INDEX.

The query planner uses the database statistics it has to choose a query plan with the lowest possible cost. Cost is measured in disk I/O and CPU effort (shown as units of disk page fetches). The goal is to minimize the total execution cost for the plan. You can view the plan for a given query using the EXPLAIN command. This will show the query planner’s estimated plan for the query. For example:

sachi=> select * from employees where employee_id=198;

-------------+------------+-----------+----------+--------------+---------------------+----------+---------+----------------+------------+---------------

198 | Donald | OConnell | DOCONNEL | 650.507.9833 | 2007-06-21 00:00:00 | SH_CLERK | 2600.00 | | 124 | 50

(1 row)

sachi=>

EXPLAIN ANALYZE causes the statement to be actually executed, not only planned. This is useful for seeing whether the planner’s estimates are close to reality. For example:

sachi=> EXPLAIN ANALYZE select * from employees where employee_id=198;

QUERY PLAN

-----------------------------------------------------------------------------------------------------------

Gather Motion 1:1 (slice1; segments: 1) (cost=0.00..3.34 rows=1 width=85)

Rows out: 1 rows at destination with 1.252 ms to first row, 1.253 ms to end, start offset by 8.761 ms.

-> Seq Scan on employees (cost=0.00..3.34 rows=1 width=85)

Filter: employee_id = 198::numeric

Rows out: 1 rows with 0.147 ms to first row, 0.162 ms to end, start offset by 9.741 ms.

Slice statistics:

(slice0) Executor memory: 183K bytes.

(slice1) Executor memory: 201K bytes (seg1).

Statement statistics: