How to identify distribution keys when moving from Oracle to Greenplum

Post date: Nov 06, 2012 7:17:24 PM

Distribution Policy

When you create or alter a table in Greenplum Database, there is an additional DISTRIBUTED clause to define the distribution policy of the table. The distribution policy determines how to divide the rows of a table across the Greenplum segments. Greenplum Database provides two types of distribution policy:

Hash Distribution - With hash distribution, one or more table columns is used as the distribution key for the table. The distribution key is used by a hashing algorithm to assign each row to a particular segment. Keys of the same value will always hash to the same segment. Choosing a unique distribution key, such as a primary key, will ensure the most even data distribution. Hash distribution is the default distribution policy for a table. If a DISTRIBUTED clause is not supplied, then either the PRIMARY KEY (if the table has one) or the first column of the table will be used as the table distribution key.

Random Distribution - With random distribution, rows are sent to the segments as they come in, cycling across the segments in a round-robin fashion. Rows with columns having the same values will not necessarily be located on the same segment. Although a random distribution ensures even data distribution, there are performance advantages to choosing a hash distribution policy whenever possible.

Identifying Distribution Keys (Oracle to Greenplum)

1. If a table contains primary key in oracle, consider it a distribution key in Greenplum.

2.  If a table in Oracle contains no primary key but unique key exists the consider unique key with fewer columns for a distribution key in Greenplum.

3. Chose Random distribution when table does not have a candidate column for the distribution key and table is small.

Use of Indexes in Greenplum - Best Practice

The general recommendation with Greenplum (and shared nothing architectures) is to start with no indexes at all. Let's put it like that: data distribution is somehow a form of indexing itself, as it allows for some data to be in a specific node/segments. Same principle applies for data partitioning. Data distribution and horizontal partitioning allow us to physically store big logical tables into smaller ones, depending on the number of nodes/segments.

The basic approach in Greenplum is to exploit data distribution for faster full scans of smaller tables, thanks to I/O and CPU scaling (even on a single server, but more obviously on a distributed cluster). However, in some cases, depending on the distribution of your data and the access patterns, you might want to add indexes, for example on foreign keys (beware of updates). I would evaluate as well the usage of bitmap indexes on fact tables.

Another approach you can try is to use column-oriented storage on append only/read only tables with many columns, which allows Greenplum to perform sequential scans of some portions of the tables (depending on the columns required by your query). You can even try and compress these tables for further performance gains.

My advice is therefore to start with no indexes, then evaluate column-oriented storage. If that's not enough or appropriate for your environment, evaluate to add some indexes, including bitmap indexes.

Distribution Data by Partitioning Large Tables

Table partitioning addresses the problem of supporting very large tables, such as fact tables, by allowing you to divide them into smaller and more manageable pieces. Partitioned tables can improve query performance by allowing the Greenplum Database query planner to scan only the relevant data needed to satisfy a given query rather than scanning the entire contents of a large table. Partitioned tables can also be used to facilitate database maintenance tasks, such as rolling old data out of the data warehouse.

Understanding Table Partitioning in Greenplum Database

Tables are partitioned at CREATE TABLE time using the PARTITION BY (and optionally the SUBPARTITION BY) clause. When you partition a table in Greenplum Database, you are actually creating a top-level (or parent) table with one or more levels of sub-tables (or child tables). Internally, Greenplum Database creates an inheritance relationship between the top-level table and its underlying partitions (similar to the functionality of the INHERITS clause of PostgreSQL).

Using the partition criteria defined during table creation, each partition is created with a distinct CHECK constraint, which limits the data that table can contain. The CHECK constraints are also used by the query planner to determine which table partitions to scan in order to satisfy a given query predicate.

Partition hierarchy information is stored in the Greenplum system catalog so that rows inserted into the top-level parent table appropriately propagate to the child table partitions. Any changes to the partition design or table structure must be done through the parent table using the PARTITION clauses of the ALTER TABLE command.

Greenplum Database supports both range partitioning (division of data based on a numerical range, such as date or price) or list partitioning (division of data based on a list of values, such as sales territory or product line), or a combination of both types.

Partitioned tables are also distributed across Greenplum Database segments as is any non-partitioned table. Table distribution in Greenplum Database physically divides a table across the Greenplum segments to enable parallel query processing. Table partitioning is a tool to logically divide big tables to improve query performance and facilitate data warehouse maintenance tasks. Partitioning does not change the physical distribution of the table data across the segments.

Deciding on a Table Partitioning Strategy - Best Practice

Not all tables are good candidates for partitioning. If the answer is yes to all or most of the following questions, then table partitioning is a viable database design strategy for improving query performance. If the answer is no to most of the following questions, then table partitioning is not the right solution for that table:

1. Is the table large enough? Large fact tables are good candidates for table partitioning. If you have millions or billions of records in a table, you will see performance benefits from logically breaking that data up into smaller chunks. For smaller tables with only a few thousand rows or less, the administrative overhead of maintaining the partitions will outweigh any performance benefits you might see.

2. Are you experiencing unsatisfactory performance? As with any performance tuning initiative, a table should be partitioned only if queries against that table are producing slower response times than desired.

3. Do your query predicates have identifiable access patterns? Examine the WHERE clauses of your query workload and look for table columns that are consistently used to access data. For example, if most of your queries tend to look up records by date, then a monthly or weekly date-partitioning design might be beneficial. Or if you tend to access records by region, consider a list-partitioning design to divide the table by region.

4. Does your data warehouse maintain a window of historical data? Another consideration for partition design is your organization’s business requirements for maintaining historical data. For example, your data warehouse may only require you to keep the past twelve months worth of data. If the data is partitioned by month, you can easily drop the oldest monthly partition from the warehouse, and load current data into the most recent monthly partition.

5. Can the data be divided into somewhat equal parts based on some defining criteria? You should choose partitioning criteria that will divide your data as evenly as possible. If the partitions contain a relatively equal number of records, query performance improves based on the number of partitions created. For example, by dividing a large table into 10 partitions, a query will execute 10 times faster than it would against the unpartitioned table (provided that the partitions are designed to support the query’s criteria).