What are the Table Distribution Policy in Greenplum

Post date: Nov 02, 2012 11:1:37 PM

All Greenplum Database tables are distributed. When you create or alter a table, there is an optional DISTRIBUTED BY (hash distribution) or DISTRIBUTED RANDOMLY (round-robin distribution) clause to declare how the rows of the table should be distributed.

The following considerations should be taken into account when declaring a distribution policy for a table (in order of importance):

1. Even Data Distribution — For the best possible performance, all of the segments should contain equal portions of data. If the data is unbalanced or skewed, then the segments with more data will have to work harder to perform their portion of the query processing. To ensure an even distribution of data, you want to choose a distribution key that is unique for each record, such as the primary key.

2. Local and Distributed Operations — During query processing, it is faster if the work associated with join, sort or aggregation operations can be done locally at the segment-level rather than at the system-level (distributing tuples across the segments). When tables share a common distribution key in Greenplum Database, joining or sorting on their shared distribution key columns will result in the most efficient query processing, as the majority of the work is done locally at the segment-level. Local operations are approximately 5 times faster than distributed operations. With a random distribution policy, local operations are not an option.

3. Even Query Processing — When a query is being processed, you want all of the segments to handle an equal amount of the query workload to get the best possible performance. In some cases, query processing workload can be skewed if the table’s data distribution policy and the query predicates are not well matched. For example, suppose you have a table of sales transactions. The table is distributed based on a column that contains corporate names as values. The hashing algorithm distributes the data based on the values of the distribution key, so if a predicate in a query references a single value from the distribution key, the work in the query will run on only one segment. This may be a viable distribution policy if your query predicates tend to select data on a criteria other than corporation name. However, for queries that do use corporation name in their predicates, you can potentially have just one segment instance handling all of the query workload.

Declaring Distribution Keys

When creating a table, there is an additional clause to declare the Greenplum Database distribution policy. If a DISTRIBUTED BY or DISTRIBUTED RANDOMLY clause is not supplied, then Greenplum assigns a hash distribution policy to the table using either the PRIMARY KEY (if the table has one) or the first column of the table as the distribution key. Columns of geometric or user-defined data types are not eligible as Greenplum distribution key columns. If a table does not have a column of an eligible data type, the rows are distributed based on a round-robin or random distribution.

To ensure an even distribution of data, you want to choose a distribution key that is unique for each record, or if that is not possible, then choose DISTRIBUTED RANDOMLY. For example:

=> CREATE TABLE products

(name varchar(40),

prod_id integer,

supplier_id integer)

DISTRIBUTED BY (prod_id);

=> CREATE TABLE random_stuff

(things text,

doodads text,

etc text)

DISTRIBUTED RANDOMLY;

Choosing the Table Storage Model

Greenplum Database provides an agile and flexible processing engine capable of supporting several storage models (or a hybrid of storage models). When you create a new table, you have several options as to how its data is stored on disk. This section explains the various options for table storage and how to decide on the best storage model for your workload.

1.Choosing Heap or Append-Only Storage

2.Choosing Row or Column-Oriented Storage

3.Using Compression (Append-Only Tables Only)

4.Checking the Compression and Distribution of an Append-Only Table

Choosing Heap or Append-Only Storage

By default, Greenplum Database uses the same heap storage model as PostgreSQL. Heap table storage favors OLTP-type workloads where the data is often modified after it is initially loaded. UPDATE and DELETE operations require row-level versioning information to be stored in order to ensure that database transactions are processed reliably. Heap tables are best suited for smaller tables, such as dimension tables, that are often updated after they are initially loaded.

Greenplum Database also offers an append-only table storage model. Append-only table storage favors denormalized fact tables in a data warehouse environment, which are typically the largest tables in the system. Fact tables are typically loaded in batches, and then accessed by read-only queries. Data is not updated after it is loaded. Moving large fact tables to an append-only storage model eliminates the storage overhead of the per-row update visibility information (about 20 bytes per row is saved). This allows for a leaner and easier-to-optimize page structure. Append-only tables do not allow UPDATE and DELETE operations. The storage model of append-only tables is optimized for bulk data loading. Single row INSERT statements are not recommended.

To create a heap table

Row-oriented heap tables are the default storage type, so no extra CREATE TABLE command syntax is required to create a heap table. For example:

=> CREATE TABLE foo (a int, b text) DISTRIBUTED BY (a);

To create an append-only table

The WITH clause of the CREATE TABLE command is used to declare the storage options of the table. If not declared, the table will be created as a regular row-oriented heap-storage table. For example, to create an append-only table with no compression:

=> CREATE TABLE bar (a int, b text)

WITH (appendonly=true)

DISTRIBUTED BY (a);

Choosing Row or Column-Oriented Storage

Greenplum provides a choice of storage orientation models: row or column (or a hybrid of both). This section provides some general guidelines for choosing the correct storage orientation for a table; however you are encouraged to evaluate performance using your own data and query workloads.

For most general purpose or mixed workloads, row-oriented storage offers the best combination of flexibility and performance. However, there are certain specific use cases where a column-oriented storage model provides more efficient I/O and storage. Consider the following requirements when deciding on the storage orientation model of a table:

1. Updates of table data. If table data must be updated after it is loaded, choose a row-oriented heap table. Column-oriented table storage is only available on append-only tables. 

2.Frequent INSERTs. If new rows are frequently inserted into the table, consider a row-oriented model. Column-oriented tables are not optimized for write operations, as column values for a row must be written to different places on disk.

3.Number of columns requested in queries. If you typically request all or the majority of columns in the SELECT list or WHERE clause of your queries, consider a row-oriented model. Column-oriented tables are best suited to queries that aggregate many values of a single column where the WHERE or HAVING predicate is also on the aggregate column,

SELECT SUM(salary)...

SELECT AVG(salary)... WHERE salary > 10000

or where the WHERE predicate is on a single column and is highly selective (returns a relatively small number of rows).

SELECT salary, dept ... WHERE state='CA'

4. Number of columns in the table. Row-oriented storage is more efficient when many columns are required at the same time, or when the row-size of a table is relatively small. Column-oriented tables can offer better query performance on wide tables (lots of columns) where you typically only access a small subset of columns in your queries.

5. Compression. Since column data is of the same data type, there are some storage size optimizations available in column-oriented data that are not available in row-oriented data. For example, many compression schemes make use of the similarity of adjacent data to compress. However, the greater adjacent compression achieved, the more difficult random access may become, as data might need to be uncompressed to be read.

To create a column-oriented table

The WITH clause of the CREATE TABLE command is used to declare the storage options of the table. If not declared, the table will be created as a row-oriented heap table. Tables that use column-oriented storage must also be append-only tables. For example, to create a column-oriented table:

=> CREATE TABLE bar (a int, b text)

WITH (appendonly=true, orientation=column)

DISTRIBUTED BY (a);

Using Compression (Append-Only Tables Only)

There are two types of in-database compression available in the Greenplum Database for append-only tables, table-level compression that is applied to an entire table and column-level compression that is applied to a specific column. When using column-level compression, different compression algorithms can be applied to different columns. The following table summarized the available compression algorithms.

Compression Algorithms for Append-only Tables

Table Orientation  Available Compression Types Supported Algorithms

------------------------   -------------------------------   ------------------------------------------------------------

Row                    Table                           ZLIB and QUICKLZ

Column               Column and Table         RLE_TYPE, ZLIB, and QUICKLZ

--------------------------------------------------------------------------------------------------------------------------

Using in-database compression requires that your segment systems have the available CPU power to compress and uncompress the data. Compressed append-only tables should not be used on file systems that also are using compression. If the file system where your segment data directory resides is a compressed file system, your append-only table should not use compression.

When choosing a compression type and level for append-only tables, consider these factors:

1. CPU usage

2. Compression ratio/disk size

3. Speed of compression

4. Speed of decompression/scan rate

Though minimizing disk size may be the main goal in compressing tables, the time and CPU capacity required to compress and scan data is also important to consider. Every system has an optimal range of settings where compression most efficiently reduces data size without causing excessively long compression times or slow scan rates.

QuickLZ compression generally uses less CPU capacity and compresses data faster at a lower compression ratio than zlib. Conversely, zlib provides higher compression ratios at lower speeds. At compression level 1 (compresslevel=1), QuickLZ and zlib may yield comparable compression ratios (though at different speeds). However, using zlib compression at a higher level of 6 might dramatically increase its advantage over QuickLZ in compression ratio (though consequently lowering the speed of compression).

Performance with compressed append-only tables depends on hardware, query tuning settings, and other factors. Greenplum recommends performing comparison testing to determine the actual performance in your environment.

To create a compressed table

The WITH clause of the CREATE TABLE command is used to declare the storage options of the table. Tables that use compression must also be append-only tables. For example, to create an append-only table with zlib compression at a compression level of 5:

=> CREATE TABLE foo (a int, b text)

WITH (appendonly=true, compresstype=zlib,

compresslevel=5);