Greenplum database new features - Table storage options and distribution policy

If you recall, we use  [ WITH ( storage_parameter=value [, ... ] ) clause to specify storage options when creating a table in Greenplum.  As you see from the create table syntax below, this is optional. 

To simplify using storage clause recursively for many tables. Now you can set  this default using gp_default_storage_options. This feature was introduced in GPDB 4.3.4.0. You can set the storage options with this parameter instead of specifying the table storage options in the WITH clause of the CREATE TABLE command. The table storage options that are specified with the CREATE TABLE command override the values specified by this parameter.

Table storage options and distribution policy

(4.3.4)

Default Database Table Distribution Policy

CREATE TABLE

Defines a new table.

CREATE [[GLOBAL | LOCAL] {TEMPORARY | TEMP}] TABLE table_name (

[ { column_name data_type [ DEFAULT default_expr ] [column_constraint [ ... ]

[ ENCODING ( storage_directive [,...] ) ]

]

 | table_constraint

 | LIKE other_table [{INCLUDING | EXCLUDING}

 {DEFAULTS | CONSTRAINTS}] ...}

 [, ... ] ]

 [column_reference_storage_directive [, …] ]

 )

 [ INHERITS ( parent_table [, ... ] ) ]

 [ WITH ( storage_parameter=value [, ... ] )

 [ ON COMMIT {PRESERVE ROWS | DELETE ROWS | DROP} ]

 [ TABLESPACE tablespace ]

 [ DISTRIBUTED BY (column, [ ... ] ) | DISTRIBUTED RANDOMLY ]

 [ PARTITION BY partition_type (column)

 [ SUBPARTITION BY partition_type (column) ]

 [ SUBPARTITION TEMPLATE ( template_spec ) ]

 [...]

 ( partition_spec )

 | [ SUBPARTITION BY partition_type (column) ]

 [...]

 ( partition_spec

 [ ( subpartition_spec

 [(...)]

 ) ]

 )

where storage_parameter is:

 APPENDONLY={TRUE|FALSE}

 BLOCKSIZE={8192-2097152}

 ORIENTATION={COLUMN|ROW}

 COMPRESSTYPE={ZLIB|QUICKLZ|RLE_TYPE|NONE}

 COMPRESSLEVEL={0-9}

 FILLFACTOR={10-100}

 OIDS[=TRUE|FALSE]

where column_constraint is:

 [CONSTRAINT constraint_name]

 NOT NULL | NULL

 | UNIQUE [USING INDEX TABLESPACE tablespace]

 [WITH ( FILLFACTOR = value )]

 | PRIMARY KEY [USING INDEX TABLESPACE tablespace]

 [WITH ( FILLFACTOR = value )]

 | CHECK ( expression )

and table_constraint is:

 [CONSTRAINT constraint_name]

 UNIQUE ( column_name [, ... ] )

 [USING INDEX TABLESPACE tablespace]

 [WITH ( FILLFACTOR=value )]

 | PRIMARY KEY ( column_name [, ... ] )

 [USING INDEX TABLESPACE tablespace]

 [WITH ( FILLFACTOR=value )]

 | CHECK ( expression )

where partition_type is:

LIST

 | RANGE

where partition_specification is:

partition_element [, ...]

and partition_element is:

 DEFAULT PARTITION name

 | [PARTITION name] VALUES (list_value [,...] )

 | [PARTITION name]

 START ([datatype] 'start_value') [INCLUSIVE | EXCLUSIVE]

 [ END ([datatype] 'end_value') [INCLUSIVE | EXCLUSIVE] ]

 [ EVERY ([datatype] [number | INTERVAL] 'interval_value') ]

 | [PARTITION name]

 END ([datatype] 'end_value') [INCLUSIVE | EXCLUSIVE]

 [ EVERY ([datatype] [number | INTERVAL] 'interval_value') ]

[ WITH ( partition_storage_parameter=value [, ... ] ) ]

[column_reference_storage_directive [, …] ]

[ TABLESPACE tablespace ]

where subpartition_spec or template_spec is:

subpartition_element [, ...]

and subpartition_element is:

 DEFAULT SUBPARTITION name

 | [SUBPARTITION name] VALUES (list_value [,...] )

 | [SUBPARTITION name]

 START ([datatype] 'start_value') [INCLUSIVE | EXCLUSIVE]

 [ END ([datatype] 'end_value') [INCLUSIVE | EXCLUSIVE] ]

 [ EVERY ([datatype] [number | INTERVAL] 'interval_value') ]

 | [SUBPARTITION name]

 END ([datatype] 'end_value') [INCLUSIVE | EXCLUSIVE]

 [ EVERY ([datatype] [number | INTERVAL] 'interval_value') ]

[ WITH ( partition_storage_parameter=value [, ... ] ) ]

[column_reference_storage_directive [, …] ]

[ TABLESPACE tablespace ]

where storage_parameter is:

 APPENDONLY={TRUE|FALSE}

 BLOCKSIZE={8192-2097152}

 ORIENTATION={COLUMN|ROW}

 COMPRESSTYPE={ZLIB|QUICKLZ|RLE_TYPE|NONE}

 COMPRESSLEVEL={0-9}

 FILLFACTOR={10-100}

 OIDS[=TRUE|FALSE]

where storage_directive is:

 COMPRESSTYPE={ZLIB | QUICKLZ | RLE_TYPE | NONE}

 | COMPRESSLEVEL={0-9}

 | BLOCKSIZE={8192-2097152}

Where column_reference_storage_directive is:

COLUMN column_name ENCODING (storage_directive [, ... ] ), ...

 |

DEFAULT COLUMN ENCODING (storage_directive [, ... ] )

gp_default_storage_options : Set the default values for the following table storage options when creating a table with the CREATE TABLE command.

APPENDONLY

BLOCKSIZE

CHECKSUM

COMPRESSTYPE

COMPRESSLEVEL

ORIENTATION

Note: Specify multiple storage option values as a comma separated list. 

Not all combinations of storage option values are valid. If the specified storage options are not valid, an error is returned.

The defaults can be set for a database and user. If the server configuration parameter is set at different levels, this the order of precedence, from highest to lowest, of the table storage values when a user logs into a database and creates a table:

The parameter value is not cumulative. For example, if the parameter specifies the APPENDONLY and COMPRESSTYPE options for a database and a user logs in and sets the parameter to specify the value for the ORIENTATION option, the APPENDONLY, and COMPRESSTYPE values set at the database level are ignored.

This example ALTER DATABASE command sets the default ORIENTATION and COMPRESSTYPE table storage options for the database sachi.

ALTER DATABASE sachi SET gp_default_storage_options = 'orientation=column, compresstype=rle_type'

To create an append-optimized table in the sachi database with column-oriented table and RLE compression. The user needs to specify only APPENDONLY=TRUE in the WITH clause.

This example gpconfig utility command sets the default storage option for a Greenplum Database system. If you set the defaults for multiple table storage options, the value must be enclosed in single quotes and then in double quotes.

gpconfig -c 'gp_default_storage_options' -v "'appendonly=true, orientation=column'"

This example gpconfig utility command shows the value of the parameter. The parameter value must be consistent across the Greenplum Database master and all segments.

gpconfig -s 'gp_default_storage_options'

VALUE RANGE

APPENDONLY= TRUE | FALSE

DEFAULT

APPENDONLY=FALSE

SET CLASSIFICATIONS

master

BLOCKSIZE= integer between 8192 and 2097152

CHECKSUM= TRUE | FALSE

COMPRESSTYPE= ZLIB | QUICKLZ |RLE_TYPE | NONE

COMPRESSLEVEL= integer between 0 and 9

ORIENTATION= ROW | COLUMN

BLOCKSIZE=32768

CHECKSUM=TRUE

COMPRESSTYPE=none

COMPRESSLEVEL=0

ORIENTATION=ROW

session

reload

Note: The set classification when the parameter is set at the system level with the gpconfig utility.

Default Database Table Distribution Policy

When you create a table without distributed by clause, Greenplum chose first valid column to distribute (hash) the table data.  This is the default behavior. In GPDB 4.3.4 and onward you can change this default behavior to random distribution using parameter gp_create_table_random_default_distribution. By default this parameter is set to OFF.

sachi=# show gp_create_table_random_default_distribution

sachi-# ;

gp_create_table_random_default_distribution

---------------------------------------------

off

(1 row)

sachi=# create table testdefaultdist( id numeric(2), name varchar(12), dob timestamp, address text);

NOTICE: Table doesn't have 'DISTRIBUTED BY' clause -- Using column named 'id' as the Greenplum Database data distribution key for this table.

HINT: The 'DISTRIBUTED BY' clause determines the distribution of data. Make sure column(s) chosen are the optimal data distribution key to minimize skew.

CREATE TABLE

sachi=#

Note that the value of the parameter is OFF, (the default), and the table creation command does not contain a DISTRIBUTED BY clause, Greenplum Database chooses the distribution key (  Using column named 'id' as the Greenplum Database data distribution key for this table) based on the table creation command.

If the value of the parameter is ON, the table is created with a random distribution if the command does not specify PRIMARY KEY or UNIQUE columns.

If the value of the parameter is OFF, (the default), and the table creation command does not contain a DISTRIBUTED BY clause, Greenplum Database chooses the table distribution key based on the command. If the LIKE or INHERITS clause is specified in table creation command, the created table uses the same distribution key as the source or parent table.

If the value of the parameter is set to ON, Greenplum Database follows these rules to create a table when the DISTRIBUTED BYclause is not specified:

If PRIMARY KEY or UNIQUE columns are not specified, the distribution of the table is random (DISTRIBUTED RANDOMLY). Table distribution is random even if the table creation command contains the LIKE or INHERITS clause.

If PRIMARY KEY or UNIQUE columns are specified, a DISTRIBUTED BY clause must also be specified. If a DISTRIBUTED BY clause is not specified as part of the table creation command, the command fails.

VALUE RANGE

boolean

DEFAULT

off

SET CLASSIFICATIONS

master

system

reload