Data Distribution: For the best possible performance, all of the segments should contain equal portions of data. If the data is unbalanced or skewed, then the segments with more data will have to work harder to perform their portion of the query processing. To ensure an even distribution of data, you want to choose a distribution key that is unique for each record, such as the primary key or if that is not possible, then choose DISTRIBUTED RANDOMLY.
Distribution Key: Make sure tables share a common distribution key as possible. Greenplum database joins or sort on their shared distribution key columns locally at the segment-level. That results the most efficient query processing. During query processing, it is faster if the work associated with join, sort or aggregation operations can be done locally at the segment-level rather than at the system-level. When tables share a common distribution key in Greenplum Database, joining or sorting on their shared distribution key columns will result in the most efficient query processing, as the majority of the work is done locally at the segment-level. Local operations are approximately 5 times faster than distributed operations. With a random distribution policy, local operations are not an option.
Primary Key: If you need to create primary key on a table then using a SERIAL or other sequence generator as the table’s primary key is preferred. However, if your application does make use of OIDs to identify specific rows of a table, it is recommended to create a unique constraint on the OID column of that table, to ensure that OIDs in the table will indeed uniquely identify rows even after counter wrap-around.
Unique Constraints: For a unique constraint to be enforced in Greenplum Database, the table must be hash-distributed (not DISTRIBUTED RANDOMLY), and the constraint columns must be the same as (or a superset of) the table’s distribution key columns.
Index/No Index: The general recommendation with Greenplum is to start with no indexes at all. Let's put it like that: data distribution is somehow a form of indexing itself, as it allows for some data to be in a specific node/segments. Same principle applies for data partitioning. Data distribution and horizontal partitioning allow us to physically store big logical tables into smaller ones, depending on the number of nodes/segments.
VACUUM: VACUUM reclaims storage occupied by deleted tuples. In normal GPDB operation, tuples that are deleted or obsoleted by an update are not physically removed from their table. They remain present on disk until a VACUUM is done. Therefore, it is necessary to do VACUUM periodically, especially on frequently-updated table.
ANALYZE: ANALYZE collects statistics about the contents of tables in the database, and stores the results in the system table pg_statistic. Subsequently, the query planner uses these statistics to help determine the most efficient execution plans for queries.
It is a good idea to run ANALYZE periodically, or just after making major changes in the contents of a table. Accurate statistics will help the query planner to choose the most appropriate query plan, and thereby improve the speed of query processing. A common strategy is to run VACUUM and ANALYZE once a day during a low-usage time of day.
Search Path setting: Search path parameter specifies the order in which schemas are searched when an object is referenced by a simple name with no schema component. When there are objects of identical names in different schemas, the one found first in the search path is used. The system catalog schema, pg_catalog, is always searched, whether it is mentioned in the path or not. When objects are created without specifying a particular target schema, they will be placed in the first schema listed in the search path.
Row-oriented tables: If table data must be updated after it is loaded, choose a row-oriented heap table. If new rows are frequently inserted into the table, consider a row-oriented model. If you typically request all or the majority of columns in the SELECT list or WHERE clause of your queries, consider a row-oriented model. Row-oriented storage is more efficient when many columns are required at the same time, or when the row-size of a table is relatively small.
Column-oriented tables: Column-oriented tables are best suited to queries that aggregate many values of a single column where the WHERE or HAVING predicate is also on the aggregate column. Column-oriented tables are not optimized for write operations, as column values for a row must be written to different places on disk. Column-oriented tables can offer better query performance on wide tables (lots of columns) where you typically only access a small subset of columns in your queries
Primary key index: No need to create an index explicitly for the primary key columns. Greenplum Database automatically creates an index for each unique constraint or primary key constraint to enforce uniqueness.)
Foreign Keys: Foreign key constraints are not supported in Greenplum Database.
Avoid VACUUM FULL: Unless you need to return space to the OS so that other tables or other parts of the system can use that space, you should use VACUUM instead of VACUUM FULL.VACUUM FULL is only needed when you have a table that is mostly dead rows, that is, the vast majority of its contents have been deleted. Even then, there is no point using VACUUM FULL unless you urgently need that disk space back for other things or you expect that the table will never again grow to its past size. Do not use it for table optimization or periodic maintenance as it is counterproductive.
- Avoid default first column distribution with many NULL values: Do not let the table distribution default to the first column which might have lots of null or common values (If a DISTRIBUTED clause is not supplied, then either the PRIMARY KEY (if the table has one) or the first column of the table will be used as the table distribution key).
Here is how the data moves from *_now to *_history table.
a. *_now is an external table whose data files are stored in $MASTER_DATA_DIRECTORY/gpperfmon/data. Data is stored in *_now during the period between data collection from the Command Center agents and automatic commitment to the *_history table.
b.*_tail is an external table whose data files are stored in $MASTER_DATA_DIRECTORY/gpperfmon/data. This is a transitional table for data that has been cleared from *_now but has not yet been committed to *_history. It typically only contains a few minutes worth of data.
c.*_history is a regular table that stores historical data. It is pre-partitioned into monthly partitions. Partitions are automatically added in two month increments as needed.
The now and tail data are stored as text files on the master file system, and accessed by Command Center database (gpperfmon) via external tables. This is why you couldn't see the *_now tables shown in pgadmIII or using \dt command while you could retrieve the data from *_now tables.
The history tables are regular database tables stored within the Command Center database so we you can see those tables.
Greenplum does not provide any support for users of Greenplum Database Community Edition. The only support for the Greenplum Database Community Edition Software is via Greenplum Community Forums: http://www.greenplum.com/community/forums/. If you are existing Greenplum customer, please contact Greenplum support as part of your license.
No. Greenplum Database: Community Edition User Level Agreement prohibits such activity. Section 3.3 – Customer agrees not to: (i) use any application it develops with the Software without securing an appropriate license from EMC; (ii) continue to develop its application without securing an appropriate license from EMC; or (iii) disclose results of any benchmark tests related to the Software without EMC’s prior written consent.
No. The Greenplum Database: Community Edition is only an end-user license and does not cover partner agreements. Please contact an EMC representative to sign a proper partner agreement before you use Greenplum Software to prepare a demo for potential customers.
No. As part of Greenplum’s Community Edition User License Agreement, Section 3.3 – Customer agrees not to: (i) use any application it develops with the Software without securing an appropriate license from EMC; (ii) continue to develop its application without securing an appropriate license from EMC; or (iii) disclose results of any benchmark tests related to the Software without EMC’s prior written consent.
No, Greenplum Database: Community Edition is only licensed for any use on one node. Either permission or an appropriate license from EMC will be needed if the customer wishes to use the software for more than a single node for any environment (including test, development, or production.)