Most common causes of performance issues in Greenplum and its solution

Post date: Jan 21, 2014 8:35:58 PM

Most common causes of performance problems in Greenplum Database are

1. Hardware and Segment Failures

2. Database Design

3. Data Distribution

4. Contention

5. Database Statistics

6. Workload

1. Identifying Hardware and Segment Failures

Greenplum Database is comprised of several servers (hosts) acting together as one cohesive system (array). Greenplum Database’s performance will be as fast as the slowest host in the array. 

Problems with CPU utilization, memory management, I/O processing, or network load affect performance. 

The most common hardware-related issues are:

1. Disk Failure: Although a single disk failure should not dramatically affect database performance if you are using RAID, disk resynchronization does consume resources on the host with failed disks. 

=>The gpcheckperf utility can help identify segment hosts that have disk I/O issues.

2. Host Failure: When a host is offline, the segments on that host are nonoperational. This means other hosts in the array must perform twice their usual workload because they are running the primary segments and multiple mirrors. If mirrors are not enabled, service is interrupted. Service is temporarily interrupted to recover failed segments. 

=>The gpstate utility helps identify failed segments.

3. Network Failure:  Failure of a network interface card, a switch, or DNS server can bring down segments. If host names or IP addresses cannot be resolved within your Greenplum array, these manifest themselves as interconnect errors in Greenplum Database. 

=>The gpcheckperf utility helps identify segment hosts that have network issues.

4. Disk Capacity: 

- Disk capacity on your segment hosts should never exceed 70 percent full

- Greenplum Database needs some free space for runtime processing. 

To reclaim disk space that deleted rows occupy, run VACUUM after loads or updates. 

The gp_toolkit administrative schema has many views for checking the size of distributed database objects. 

Contact Greenplum Support about fixing hardware failure issues. 

2. Optimizing Database Design

Many performance issues can be improved by database design. Examine your database design and consider the following:

1. Does the schema reflect the way the data is accessed?

2. Can larger tables be broken down into partitions?

3. Are you using the smallest data type possible to store column values?

4. Are columns used to join tables of the same datatype?

6. Are your indexes being used?

 

3. Optimizing Data Distribution

 When you create a table in Greenplum Database, you must declare a distribution key that allows for even data distribution across all segments in the system. Because the segments work on a query in parallel, Greenplum Database will always be as fast as the slowest segment. If the data is unbalanced, the segments that have more data will return their results slower and therefore slow down the entire system.

4. Avoiding Contention

Contention arises when multiple users or workloads try to use the system in a conflicting way; for example, contention occurs when two transactions try to update a table simultaneously. A transaction that seeks a table-level or row-level lock will wait indefinitely for conflicting locks to be released. 

Applications should not hold transactions open for long periods of time, for example, while waiting for user input.

Before you interpret a query plan for a query using EXPLAIN or EXPLAIN ANALYZE, familiarize yourself with the data to help identify possible statistics problems. Check the plan for the following indicators of inaccurate statistics:

1. Are the planner’s estimates close to reality? Run EXPLAIN ANALYZE and see if the number of rows the planner estimated is close to the number of rows the query operation returned .

2. Are selective predicates applied early in the plan? The most selective filters should be applied early in the plan so fewer rows move up the plan tree.

3. Is the planner choosing the best join order? When you have a query that joins multiple tables, make sure the planner chooses the most selective join order. Joins that eliminate the largest number of rows should be done earlier in the plan so fewr less rows move up the plan tree.

See also - Query Profiling

Tuning Statistics Collection : The following configuration parameters control the amount of data sampled for statistics collection:

1. default_statistics_target

2. gp_analyze_relative_error

These parameters control statistics sampling at the system level. It is better to sample only increased statistics for columns used most frequently in query predicates. You can adjust statistics for a particular column using the command:

ALTER TABLE...SET STATISTICS

For example:

ALTER TABLE sales ALTER COLUMN region SET STATISTICS 50;

This is equivalent to increasing default_statistics_target for a particular column. Subsequent ANALYZE operations will then gather more statistics data for that column and produce better query plans as a result.

4. Maintaining Database Statistics

Greenplum Database uses a cost-based query planner that relies on database statistics. Accurate statistics allow the query planner to better estimate the number of rows retrieved by a query to choose the most efficient query plan. Without database statistics, the query planner cannot estimate how many records will be returned. The planner does not assume it has sufficient memory to perform certain operations such as aggregations, so it takes the most conservative action and does these operations by reading and writing from disk. This is significantly slower than doing them in memory. ANALYZE collects statistics about the database that the query planner needs.

6. Checking Workload 

Greenplum database system has a limited CPU capacity, memory, and disk I/O resources. Check your hardware configuration details (FULL RAC DCA, HALF RAC DCA, QTR RAC DCA) to learn about available resources.

When multiple workloads compete for access to these resources, database performance suffers. resource queues are used in greenplum database to manage the workload.

Workload management maximizes system throughput while meeting varied business requirements. 

With role-based resource queues, Greenplum Database workload management limits active queries and conserves system resources.

A resource queue limits the size and/or total number of queries that users or roles can execute in the particular queue. 

By assigning all your database roles to the appropriate resource queue, administrators can control concurrent user queries and prevent system overload. 

See  “Managing Workload and Resources” for more information about setting up resource queues.

Greenplum Database administrators should run maintenance workloads such as data loads and VACUUM ANALYZE operations after business hours. Do not compete with database users for system resources; perform administrative tasks at low-usage times.