Avoiding Data Skew - Best Practice

posted Dec 4, 2012, 12:54 PM by Sachi Ojha
 Set up even data distribution when you create a table by using a DISTRIBUTED BY clause in the CREATE TABLE statement. Setting the DISTRIBUTED BY based on the unique or primary key, or a key that is relatively unique, is an easy way to avoid skew.  You can use the distributed by clause in an ALTER TABLE statement to change distribution in an existing table.

1. You can specify more than one distribution column with a clause like DISTRIBUTED BY (prod_id, name, company_id). If the combined columns are unique or nearly so, even distribution is guaranteed.  Otherwise use your knowledge of the data to pick a distribution key that will not have skew.

2. With no DISTRIBUTED clause,  the distribution key defaults to the PRIMARY KEY or UNIQUE constraint on the table.  If neither constraint is specified the first column of the table is the 
default distribution key. If that column is not suitable, data skews will occur.  In that case you need to explicitly set the DISTRIBUTED BY to a column or set of columns that will distribute evenly. 

3. If you cannot figure out a combination of columns that will distribute without skew, using the DISTRIBUTED RANDOMLY clause  will distribute the data randomly and evenly. 

4. Use  the views gp_skew_coefficients  and gp_skew_idle_fractions in gp_toolkit to find out how data is distributed on the segments.

Skewed Distribution Can
1. Degrade overall performance
2. Overflow a disk
3. Significantly slow down query processing

Choosing a Distribution Key
1. Optimizes space usage
2.  Improves I/O performance
3. Bases the distribution on your knowledge of the data and the queries that are run against it