Monitoring distribution keys in Greenplum

Post date: Jul 25, 2014 12:16:35 PM

1. Run the following query to find out distribution keys of the every table in the Greenplum database . (Filter can be applied to schema level)

select table_owner,table_name, COALESCE(distribution_keys,'DISTRIBUTED RANDOMLY') distribution_keys

 from

(

SELECT pgn.nspname as table_owner,

pgc.relname as table_name,

pga.attname as distribution_keys

FROM (SELECT gdp.localoid,

CASE

WHEN ( Array_upper(gdp.attrnums, 1) > 0 ) THEN

Unnest(gdp.attrnums)

ELSE NULL

END AS attnum

FROM gp_distribution_policy gdp

ORDER BY gdp.localoid) AS distrokey

INNER JOIN pg_class AS pgc

ON distrokey.localoid = pgc.oid

INNER JOIN pg_namespace pgn

ON pgc.relnamespace = pgn.oid

LEFT OUTER JOIN pg_attribute pga

ON distrokey.attnum = pga.attnum

AND distrokey.localoid = pga.attrelid

ORDER BY pgn.nspname,

pgc.relname) as a;

2. Analyze distribution keys for each table

3. There might be some table where there is no distribution key. Recreate table with proper distribution key.

4. Run the following query to see distributions of table data at segment level.

SELECT COUNT(*), gp_segment_id FROM <table-name> GROUP BY gp_segment_id;

SELECT 'facts' as "Table Name",max(c) as "Max Seg Rows", min(c) as "Min Seg Rows", (max(c)-min(c))*100.0/max(c) as "Percentage Difference Between Max & Min" from (SELECT count(*) c, gp_segment_id from facts group by 2) as a;

5. Analyze the output. If data is not distributed evenly across the segments then analyze the actual data if different column or combination of columns can be used as new distribution keys.

6. Alter table to update the distributions keys.

Altering Table Distribution

ALTER TABLE provides options to change a table’s distribution policy . When the table distribution options change, the table data is redistributed on disk, which can be resource intensive. You can also redistribute table data using the existing distribution policy.

Changing the Distribution Policy

For partitioned tables, changes to the distribution policy apply recursively to the child partitions. This operation preserves the ownership and all other attributes of the table. For example, the following command redistributes the table sales across all segments using the customer_id column as the distribution key:

ALTER TABLE sales SET DISTRIBUTED BY (customer_id);

When you change the hash distribution of a table, table data is automatically redistributed. 

Changing the distribution policy to a random distribution does not cause the data to be redistributed. For example:

ALTER TABLE sales SET DISTRIBUTED RANDOMLY;

Redistributing Table Data

To redistribute table data for tables with a random distribution policy (or when the hash distribution policy has not changed) use REORGANIZE=TRUE. Reorganizing data may be necessary to correct a data skew problem, or when segment resources are added to the system. For example, the following command redistributes table data across all segments using the current distribution policy, including random distribution.

ALTER TABLE sales SET WITH (REORGANIZE=TRUE);

Altering the Table Storage Model

Table storage, compression, and orientation can be declared only at creation. To change the storage model, you must create a table with the correct storage options,

load the original table data into the new table, drop the original table, and rename the new table with the original table’s name. You must also re-grant any table

permissions. For example:

CREATE TABLE sales2 (LIKE sales) WITH (appendonly=true, compresstype=quicklz, compresslevel=1, orientation=column);

INSERT INTO sales2 SELECT * FROM sales;

DROP TABLE sales;

ALTER TABLE sales2 RENAME TO sales;

GRANT ALL PRIVILEGES ON sales TO admin;

GRANT SELECT ON sales TO guest;

Use ALTER TABLE command to add a compressed column to a table. 

ALTER TABLE sachi ADD COLUMN newcompresscol1 int DEFAULT 0 ENCODING (COMPRESSTYPE=zlib);