Computational Skew vs Data Skew
Post date: Aug 25, 2014 6:55:51 PM
Data skew is caused by an uneven distribution of data because of the wrong selection of distribution columns. It is present at the table level, can be easily identified and avoided by selecting optimal distribution columns.
Computational skew happens in flight when a query is executing and is not as easy to detect. It can happen for various operations like join, sort,
aggregation, and various OLAP operations. Computational skew occurs for operations on columns that have low cardinality and a non-uniform distribution.
For example, if a row_number function is used on a state column of customer table, it will result in more data flowing to the segment that contains rows for ‘CA’, resulting in computational skew. Computational skew occurs if instead of a two stage aggregation, a one stage aggregation is chosen by the Greenplum Database optimizer for columns with low cardinality.
Computational skew may be detected using Greenplum Database Command Center. Search and identify the query through Query Monitor and open the Query Plan tab. For each operator there will be an attribute CPU Skew that can help in detecting computational skew.
Computational skew may also be detected by analyzing the query plan. Each operator outputs max rows processed by a segment and average rows processed by all segments for that operator. If max row is greater than average rows processed, one of the segments has done more work than the others indicating there might
be skew at that operator level.
If computational skew is detected and the query performance is not acceptable there are few workarounds. The first is to rewrite the query. For example, creating temp tables to eliminate the skew as temp tables can be randomly distributed so the two stage aggregation is forced. Using optimizer GUCs to influence and change the query plan may provide a workaround however, take care when setting these GUCs and only use them at the session or query level.
For example, using a forced broadcast motion of dimension tables can help in eliminating certain join skews. A forced broadcast motion can be enabled setting the gp_segments_for_planner GUC. It is difficult if not impossible to workaround OLAP queries.