Maintaining database statistics

Post date: Oct 10, 2013 11:26:13 PM

Greenplum Database uses a cost-based query planner that relies on database statistics. Accurate statistics allow the query planner to better estimate the number of rows retrieved by a query in order to choose the most efficient query plan. Without database statistics, the query planner can not estimate how many records might be returned, and therefore cannot assume it has sufficient memory to perform certain operations such as aggregations. In this case, the planner always takes the safe route and does aggregations by reading/writing from disk, which is significantly slower than doing them in memory. The ANALYZE command collects statistics about the database needed by the query planner.

Identifying Statistics Problems in Query Plans

When looking at the query plan for a query using EXPLAIN or EXPLAIN ANALYZE, it helps to know your data in order to identify possible statistics problems. Check the plan for the following indicators of inaccurate statistics:

1.Are the planner’s estimates close to reality? Run an EXPLAIN ANALYZE and see if the number of rows estimated by the planner is close to the number of rows actually returned by the query operation.

2.Are selective predicates applied early in the plan? The most selective filters should be applied early in the plan so that less rows move up the plan tree.

3.Is the planner choosing the best join order? When you have a query that joins multiple tables, make sure that the planner is choosing the most selective join order. Joins that eliminate the largest number of rows should be done earlier in the plan so that less rows move up the plan tree.