Monitoring Skew

There are 2 kinds of Skew in Greenplum

1. Data Skew

2. Computational Skew also called Query processing skew

Data skew is caused by an uneven distribution of data because of the wrong selection of distribution columns. It is present at the table level, can be easily identified and avoided by selecting optimal distribution columns.

Avoiding Data Skew

Set up even data distribution when you create a table by using a DISTRIBUTED BY clause in the CREATE TABLE statement. Setting the DISTRIBUTED BY based on the unique or primary key, or a key that is relatively unique, is an easy way to avoid skew. You can use the distributed by clause in an ALTER TABLE statement to change distribution in an existing table.

1. You can specify more than one distribution column with a clause like DISTRIBUTED BY (prod_id, name, company_id). If the combined columns are unique or nearly so, even distribution is guaranteed. Otherwise use your knowledge of the data to pick a distribution key that will not have skew.

2. With no DISTRIBUTED clause, the distribution key defaults to the PRIMARY KEY or UNIQUE constraint on the table. If neither constraint is specified the first column of the table is the

default distribution key. If that column is not suitable, data skews will occur. In that case you need to explicitly set the DISTRIBUTED BY to a column or set of columns that will distribute evenly.

3. If you cannot figure out a combination of columns that will distribute without skew, using the DISTRIBUTED RANDOMLY clause will distribute the data randomly and evenly.

4. Use the views gp_skew_coefficients and gp_skew_idle_fractions in gp_toolkit to find out how data is distributed on the segments.

1. gp_skew_coefficients

gpadmin=# \d gp_toolkit.gp_skew_coefficients

View "gp_toolkit.gp_skew_coefficients"

Column    |  Type   | Modifiers 

--------------+---------+-----------

 skcoid       | oid     | 

 skcnamespace | name    | 

 skcrelname   | name    | 

 skccoeff     | numeric | 

View definition:

 SELECT skew.skewoid AS skcoid, pgn.nspname AS skcnamespace, pgc.relname AS skcrelname, skew.skewval AS skccoeff

   FROM gp_toolkit.__gp_skew_coefficients() skew(skewoid, skewval)

   JOIN pg_class pgc ON skew.skewoid = pgc.oid

   JOIN pg_namespace pgn ON pgc.relnamespace = pgn.oid;

This view shows data distribution skew by calculating the coefficient of variation (CV) for the data stored on each segment. This view is accessible to all users, however non-superusers will only be able to see tables that they have permission to access.

skcoid=>The object id of the table.

skcnamespace=>The namespace where the table is defined.

skcrelname=>The table name.

skccoeff=>The coefficient of variation (CV) is calculated as the standard deviation divided by the average. It takes into account both the average and variability around the average of a data series. The lower the value, the better. Higher values indicate greater data skew.

2. gp_skew_idle_fractions

gpadmin=# \d gp_toolkit.gp_skew_idle_fractions

View "gp_toolkit.gp_skew_idle_fractions"

Column | Type | Modifiers 

--------------+---------+-----------

sifoid | oid | 

sifnamespace | name | 

sifrelname | name | 

siffraction | numeric | 

View definition:

SELECT skew.skewoid AS sifoid, pgn.nspname AS sifnamespace, pgc.relname AS sifrelname, skew.skewval AS siffraction

FROM gp_toolkit.__gp_skew_idle_fractions() skew(skewoid, skewval)

JOIN pg_class pgc ON skew.skewoid = pgc.oid

JOIN pg_namespace pgn ON pgc.relnamespace = pgn.oid;

This view shows data distribution skew by calculating the percentage of the system that is idle during a table scan, which is an indicator of processing data skew. This view is accessible to all users, however non-superusers will only be able to see tables that they have permission to access.

sifoid=>The object id of the table.

sifnamespace=>The namespace where the table is defined.

sifrelname=>The table name.

siffraction=>The percentage of the system that is idle during a table scan, which is an indicator of uneven data distribution or query processing skew. For example, a value of 0.1 indicates 10% skew, a value of 0.5 indicates 50% skew, and so on. Tables that have more than 10% skew should have their distribution policies evaluated.

Checking Skew

After loading data always check for data skew using either of the SQL statements. The first query will return one row for each segment. The second query will return a single row for the table.

SELECT COUNT(*), gp_segment_id FROM <table-name> GROUP BY gp_segment_id;

SELECT 'facts' as "Table Name",max(c) as "Max Seg Rows", min(c) as "Min Seg Rows", (max(c)-min(c))*100.0/max(c) as "Percentage Difference Between Max & Min" from (SELECT count(*) c, gp_segment_id from facts group by 2) as a;

Skewed Distribution Can

1. Degrade overall performance

2. Overflow a disk

3. Significantly slow down query processing

Choosing a Distribution Key

1. Optimizes space usage

2.  Improves I/O performance

3. Bases the distribution on your knowledge of the data and the queries that are run against it

Computational skew/Query Processing Skew

Computational skew happens in flight when a query is executing and is not as easy to detect. It can happen for various operations like join, sort, aggregation, and various OLAP operations. Computational skew occurs for operations on columns that have low cardinality and a non-uniform distribution. 

Computational skew may be detected using Greenplum Database Command Center. Search and identify the query through Query Monitor and open the Query Plan tab. For each operator there will be an attribute CPU Skew that can help in detecting computational skew. 

Computational skew may also be detected by analyzing the query plan. Each operator outputs max rows processed by a segment and average rows processed by all segments for that operator. If max row is greater than average rows processed, one of the segments has done more work than the others indicating there might 

be skew at that operator level.

Checking for Query Processing Skew

When a query is being processed, all segments should have equal workloads to ensure the best possible performance. If you identify a poorly-performing query, you may need to investigate further using the EXPLAIN command.

Query processing workload can be skewed if the table’s data distribution policy and the query predicates are not well matched. To check for processing skew, you can run a query such as:

=# SELECT gp_segment_id, count(*) FROM table_name WHERE column='value' GROUP BY gp_segment_id;

This will show the number of rows returned by segment for the given WHERE predicate.