Greenplum Database Users FAQ

Do's and don'ts for Greenplum database users

posted Jun 19, 2013, 3:54 PM by Sachchida Ojha   [ updated Jun 19, 2013, 3:56 PM ]

Do’s



  • Data Distribution: For the best possible performance, all of the segments should contain equal portions of data. If the data is unbalanced or skewed, then the segments with more data will have to work harder to perform their portion of the query processing. To ensure an even distribution of data, you want to choose a distribution key that is unique for each record, such as the primary key or if that is not possible, then choose DISTRIBUTED RANDOMLY.


  • Distribution Key: Make sure tables share a common distribution key as possible. Greenplum database joins or sort on their shared distribution key columns locally at the segment-level. That results the most efficient query processing. During query processing, it is faster if the work associated with join, sort or aggregation operations can be done locally at the segment-level rather than at the system-level. When tables share a common distribution key in Greenplum Database, joining or sorting on their shared distribution key columns will result in the most efficient query processing, as the majority of the work is done locally at the segment-level. Local operations are approximately 5 times faster than distributed operations. With a random distribution policy, local operations are not an option.


  • Primary Key: If you need to create primary key on a table then using a SERIAL or other sequence generator as the table’s primary key is preferred. However, if your application does make use of OIDs to identify specific rows of a table, it is recommended to create a unique constraint on the OID column of that table, to ensure that OIDs in the table will indeed uniquely identify rows even after counter wrap-around.


  • Unique Constraints: For a unique constraint to be enforced in Greenplum Database, the table must be hash-distributed (not DISTRIBUTED RANDOMLY), and the constraint columns must be the same as (or a superset of) the table’s distribution key columns.


  • Index/No Index: The general recommendation with Greenplum is to start with no indexes at all. Let's put it like that: data distribution is somehow a form of indexing itself, as it allows for some data to be in a specific node/segments. Same principle applies for data partitioning. Data distribution and horizontal partitioning allow us to physically store big logical tables into smaller ones, depending on the number of nodes/segments.


  • Partitioning: Partitioning improve query performance on a very large tables. Potential candidates are large fact tables. Table partitioning is a tool to logically divide big tables to improve query performance and facilitate data warehouse maintenance tasks. Partitioning does not change the physical distribution of the table data across the segments.


  • VACUUM: VACUUM reclaims storage occupied by deleted tuples. In normal GPDB operation, tuples that are deleted or obsoleted by an update are not physically removed from their table. They remain present on disk until a VACUUM is done. Therefore, it is necessary to do VACUUM periodically, especially on frequently-updated table.


  • ANALYZE: ANALYZE collects statistics about the contents of tables in the database, and stores the results in the system table pg_statistic. Subsequently, the query planner uses these statistics to help determine the most efficient execution plans for queries.

    It is a good idea to run ANALYZE periodically, or just after making major changes in the contents of a table. Accurate statistics will help the query planner to choose the most appropriate query plan, and thereby improve the speed of query processing. A common strategy is to run VACUUM and ANALYZE once a day during a low-usage time of day.


  • Search Path setting: Search path parameter specifies the order in which schemas are searched when an object is referenced by a simple name with no schema component.  When there are objects of identical names in different schemas, the one found first in the search path is used. The system catalog schema, pg_catalog, is always searched, whether it is mentioned in the path or not. When objects are created without specifying a particular target schema, they will be placed in the first schema listed in the search path.


  • Row-oriented tables: If table data must be updated after it is loaded, choose a row-oriented heap table. If new rows are frequently inserted into the table, consider a row-oriented model. If you typically request all or the majority of columns in the SELECT list or WHERE clause of your queries, consider a row-oriented model. Row-oriented storage is more efficient when many columns are required at the same time, or when the row-size of a table is relatively small.


  • Column-oriented tables: Column-oriented tables are best suited to queries that aggregate many values of a single column where the WHERE or HAVING predicate is also on the aggregate column. Column-oriented tables are not optimized for write operations, as column values for a row must be written to different places on disk. Column-oriented tables can offer better query performance on wide tables (lots of columns) where you typically only access a small subset of columns in your queries


  • Compressing append-only tables: When choosing a compression type and level for append-only tables, consider these factors. A) CPU usage B) Compression ratio/disk size C) Speed of compression D) Speed of decompression/scan rate. Column-oriented table storage is only available on append-only tables. 



Don’ts


  • OID’s: Avoid assuming that OIDs are unique across tables; if you need a database-wide unique identifier, use the combination of table OID and row OID for the purpose. Using OIDs in new applications is not recommended.


  • User defined data types: Columns of geometric or user-defined data types are not eligible as Greenplum distribution key columns. If a table does not have a column of an eligible data type, the rows are distributed based on a round-robin or random distribution.


  • Primary key index: No need to create an index explicitly for the primary key columns. Greenplum Database automatically creates an index for each unique constraint or primary key constraint to enforce uniqueness.)

  • Foreign Keys: Foreign key constraints are not supported in Greenplum Database.


  • Avoid unnecessary partitioning:  Do not create partitions if your table is small in size. Partitioned tables are also distributed across Greenplum Database segments as is any non-partitioned table. Table distribution in Greenplum Database physically divides a table across the Greenplum segments to enable parallel query processing.


  • Avoid CTAS for large table: If you need to create a duplicate copy of large fact table in another user schema, use transactions to split the tasks. Avoid using CTAS. CTAS will consume all your available resources and may freeze the greenplum database.


  • Avoid VACUUM FULL: Unless you need to return space to the OS so that other tables or other parts of the system can use that space, you should use VACUUM instead of VACUUM FULL.VACUUM FULL is only needed when you have a table that is mostly dead rows, that is, the vast majority of its contents have been deleted. Even then, there is no point using VACUUM FULL unless you urgently need that disk space back for other things or you expect that the table will never again grow to its past size. Do not use it for table optimization or periodic maintenance as it is counterproductive.


  • Avoid default first column distribution with many NULL values: Do not let the table distribution default to the first column which might have lots of null or common values (If a DISTRIBUTED clause is not supplied, then either the PRIMARY KEY (if the table has one) or the first column of the table will be used as the table distribution key).

Does DIA supports health monitoring and ConnectEMC dial home notifications?

posted Nov 1, 2012, 7:43 AM by Sachi Ojha   [ updated Nov 1, 2012, 7:44 AM ]

How to check database activity?

posted Oct 27, 2012, 4:10 PM by Sachchida Ojha

  • Checking for Active Sessions (Workload)
  • Checking for Locks (Contention)
  • Checking Query Status and System Utilization
  •  

    how often the data in *_now table moved to *_history tables

    posted Oct 27, 2012, 4:00 PM by Sachchida Ojha

    Here is how the data moves from *_now to *_history table.

     a. *_now is an external table whose data files are stored in $MASTER_DATA_DIRECTORY/gpperfmon/data. Data is stored in *_now during the period between data collection from the Command Center agents and automatic commitment to the *_history table.

     b.*_tail is an external table whose data files are stored in $MASTER_DATA_DIRECTORY/gpperfmon/data. This is a transitional table for data that has been cleared from *_now but has not yet been committed to *_history. It typically only contains a few minutes worth of data.

     c.*_history is a regular table that stores historical data. It is pre-partitioned into monthly partitions. Partitions are automatically added in two month increments as needed.

    why I can't see the *_now tables in pgadmIII and using \dt command while I could retrieve the data from *_now tables?

    posted Oct 27, 2012, 3:58 PM by Sachchida Ojha

    The now and tail data are stored as text files on the master file system, and accessed by Command Center database (gpperfmon) via external tables.  This is why you couldn't see the *_now tables shown in pgadmIII or using \dt command while you could retrieve the data from *_now tables.
     
    The history tables are regular database tables stored within the Command Center database so we  you can see those tables.

    I am having some problems installing and/or using Greenplum Database Community Edition. Does Greenplum provide any support for Greenplum Database Community Edition?

    posted Oct 3, 2012, 6:18 AM by Sachchida Ojha

    Greenplum does not provide any support for users of Greenplum Database Community Edition. The only support for the Greenplum Database Community Edition Software is via Greenplum Community Forums: http://www.greenplum.com/community/forums/. If you are existing Greenplum customer, please contact Greenplum support as part of your license.

    Can I benchmark Greenplum Database against other databases or NoSQL Solutions and publish the results on my blog or share it with a third party?

    posted Oct 3, 2012, 6:17 AM by Sachchida Ojha

    No. Greenplum Database: Community Edition User Level Agreement prohibits such activity. Section 3.3 – Customer agrees not to: (i) use any application it develops with the Software without securing an appropriate license from EMC; (ii) continue to develop its application without securing an appropriate license from EMC; or (iii) disclose results of any benchmark tests related to the Software without EMC’s prior written consent.

    I am a Greenplum/EMC partner and I want to demo an application using Greenplum Database. Is that allowed?

    posted Oct 3, 2012, 6:16 AM by Sachchida Ojha

    No. The Greenplum Database: Community Edition is only an end-user license and does not cover partner agreements. Please contact an EMC representative to sign a proper partner agreement before you use Greenplum Software to prepare a demo for potential customers.

    Can I develop an application using Greenplum Database Community Edition and sell it for commercial use?

    posted Oct 3, 2012, 6:15 AM by Sachchida Ojha

    No. As part of Greenplum’s Community Edition User License Agreement, Section 3.3 – Customer agrees not to: (i) use any application it develops with the Software without securing an appropriate license from EMC; (ii) continue to develop its application without securing an appropriate license from EMC; or (iii) disclose results of any benchmark tests related to the Software without EMC’s prior written consent.

    Can I use Greenplum Database: Community Edition on my 100-node test cluster?

    posted Oct 3, 2012, 6:14 AM by Sachchida Ojha

    No, Greenplum Database: Community Edition is only licensed for any use on one node. Either permission or an appropriate license from EMC will be needed if the customer wishes to use the software for more than a single node for any environment (including test, development, or production.)

    1-10 of 31