Greenplum DBA routine tasks

Routine Maintenance Tasks

1. Vacuum and Analyze tables

2. System Catalog Maintenance

3. Reindex Tables

4. Disk space monitoring  (checking database object sizes and disk space) and cleanup 

5. Monitoring distributions and partitions  

6. Managing Log Files

7. Monitor bloats and disk spills

8. Monitoring long running queries

9. DCA health check

10. Monitoring cron jobs and backups 

11 Monitoring user activity and resource consumption

Vacuum and Analyze tables

Greenplum uses the MVCC transaction model. Rows that are deleted or updated are expired but not visible to new transactions. Expired rows will take up physical space in the database until they are permanently deleted using the vacuum command. Vacuum tables periodically to remove expired rows. 

Recommendation is to run vacuum against your tables on a weekly basis. More frequently if there are a lot of updates and deletes occurring. Always vacuum after new data has been loaded. This will cut down on a performance issue known as table bloat.  Expired rows are held in the free space map. 

Bloated tables kill query performance!!

If this becomes full to overflowing, vacuum cannot reclaim space from expired rows. Vacuum also collects table statistics. It is not recommended to run a vacuum full on the database. Vacuum with Analyze: Greenplum uses a cost-based query planner that depends on accurate database statistics to run sql queries optimally. 

Tables that are updated frequently end up with stale statistics. The Analyze command collects column-level statics for the query planner to use

# vacuum analyze mytable;

Recommended: run vacuum analyze on your largest, most frequently updated tables. 

Note:  vacuumdb is a wrapper that does nothing more than vacuum

2. Catalog Maintenance

The system catalog becomes bloated when there are lot’s of database updates with the create and delete commands. Over time this will adversely affect Greenplum performance. If left unchecked, the system catalog may become too bloated and a vacuum full will need to be performed on the system catalog. Run vacuum on the system catalog regularly. For a system that has lot’s of drop commands, run nightly.

The following example script performs a VACUUM of the Greenplum Database system catalog:

#!/bin/bash

DBNAME="<database_name>"

VCOMMAND="VACUUM ANALYZE"

psql -tc "select '$VCOMMAND' || ' pg_catalog.' || relname || ';' from pg_class a,pg_namespace b where a.relnamespace=b.oid and b.nspname= 'pg_catalog' 

and a.relkind='r'" $DBNAME | psql -a

$DBNAME

Intensive System Catalog Maintenance

If a system catalog maintenance procedure has not been performed in a long time, the catalog can become bloated with dead space; this causes excessively long wait times for simple metadata operations. A wait of more than two seconds to list user tables, such as with the \d metacommand from within psql, is an indication of catalog bloat.

If you see indications of system catalog bloat, you must perform an intensive system catalog maintenance procedure with VACUUM FULL during a scheduled downtime period. During this period, stop all catalog activity on the system; the FULL system catalog maintenance procedure takes exclusive locks against the system catalog.

Running regular system catalog maintenance procedures can prevent the need for this more costly procedure.

3. Reindex Tables

Rebuild poorly performing indexes using the reindex command. This reclaims wasted space in the index. It is generally faster to drop and create an index than use

reindex to resolve index performance issues. Update and Delete operations do not update bitmap indexes. Use reindex instead.

Reindexdb utility

# reindex mytable; rebuilds all of the indexes on a table

# reindex myindex; rebuilds a particular index

Wrapper around the reindex command

# reindexdb –table <table> –index <index> <database>

Other command line switches documented in the Greenplum Utilities Guide

# reindex --system

Use to reindex the system catalog periodically as a Best Practice

# reindex --all

Use to reindex all indexes in all of the databases. This may run for a long time

4. Disk Space and Object size monitoring 

a) Checking database object sizes and disk space

The gp_size_* family of views can be used to determine the disk space usage for a distributed Greenplum database, schema, table, or index. The following views 

calculate the total size of an object across all primary segments (mirrors are not included in the size calculations).

1.gp_size_of_all_table_indexes

2. gp_size_of_database

3. gp_size_of_index

4. gp_size_of_partition_and_indexes_disk

5. gp_size_of_schema_disk

6. gp_size_of_table_and_indexes_disk

7. gp_size_of_table_and_indexes_licensing

8. gp_size_of_table_disk

9. gp_size_of_table_uncompressed

10. gp_disk_free

The table and index sizing views list the relation by object ID (not by name). To check the size of a table or index by name, you must look up the relation name (relname) in the pg_class table. 

For example:

SELECT relname as name, sotdsize as size, sotdtoastsize as toast, sotdadditionalsize as other 

FROM gp_size_of_table_disk as sotd, pg_class 

WHERE sotd.sotdoid=pg_class.oid ORDER BY relname;

Lets look into details of each view.

Note: views are accessible to all users, however non-superusers will only be able to see relations that they have permission to access.

1. gp_size_of_all_table_indexes: This view shows the total size of all indexes for a table. This view is accessible to all users, however non-superusers will only be able to see relations that they have permission to access.

soatioid=> The object ID of the table

soatisize=> The total size of all table indexes in bytes

soatischemaname=> The schema name

soatitablename=> The table name

2. gp_size_of_database: This view shows the total size of a database. This view is accessible to all users, however non-superusers will only be able to see databases that they have permission to access.

sodddatname=> The name of the database

sodddatsize=> The size of the database in bytes

3. gp_size_of_index: This view shows the total size of an index. This view is accessible to all users, however non-superusers will only be able to see relations that they have permission to access.

soioid=> The object ID of the index

soitableoid=> The object ID of the table to which the index belongs

soisize=> The size of the index in bytes

soiindexschemaname=> The name of the index schema

soiindexname=> The name of the index

soitableschemaname=> The name of the table schema

soitablename=> The name of the table

4. gp_size_of_partition_and_indexes_disk: This view shows the size on disk of partitioned child tables and their indexes. This view is accessible to all users, however non-superusers will only be able to see relations that they have permission to access.

sopaidparentoid=> The object ID of the parent table

sopaidpartitionoid=> The object ID of the partition table

sopaidpartitiontablesize=> The partition table size in bytes

sopaidpartitionindexessize=> The total size of all indexes on this partition

Sopaidparentschemaname=> The name of the parent schema

Sopaidparenttablename=> The name of the parent table

Sopaidpartitionschemaname=> The name of the partition schema

sopaidpartitiontablename=> The name of the partition table

5. gp_size_of_schema_disk: This view shows schema sizes for the schemas in the current database. This view is accessible to all users, however non-superusers will only be able to see schemas that they have permission to access.

sosdnsp=> The name of the schema

sosdschematablesize=> The total size of tables in the schema in bytes

sosdschemaidxsize=> The total size of indexes in the schema in bytes

6. gp_size_of_table_and_indexes_disk : This view shows the size on disk of tables and their indexes. This view is accessible to all users, however non-superusers will only be able to see relations that they have permission to access.

sotaidoid=> The object ID of the parent table

sotaidtablesize=> The disk size of the table

sotaididxsize=> The total size of all indexes on the table

sotaidschemaname=> The name of the schema

sotaidtablename=> The name of the table

7. gp_size_of_table_and_indexes_licensing: This view shows the total size of tables and their indexes for licensing purposes. The use of this view requires superuser permissions.

sotailoid=> The object ID of the table

sotailtablesizedisk=> The total disk size of the table

sotailtablesizeuncompressed=> If the table is a compressed append-only table, shows the uncompressed table size in bytes.

sotailindexessize=> The total size of all indexes in the table

sotailschemaname=> The schema name

sotailtablename=> The table name

8. gp_size_of_table_disk : This view shows the size of a table on disk. This view is accessible to all users, however non-superusers will only be able to see tables that they have permission to access.

sotdoid=> The object ID of the table

sotdsize=> The total size of the table in bytes (main relation, plus oversized (toast) attributes, plus additional storage objects for AO tables).

sotdtoastsize=> The size of the TOAST table (oversized attribute storage), if there is one.

sotdadditionalsize=> Reflects the segment and block directory table sizes for append-only (AO) tables.

sotdschemaname=> The schema name

sotdtablename=> The table name

9. gp_size_of_table_uncompressed: This view shows the uncompressed table size for append-only (AO) tables. Otherwise, the table size on disk is shown. The use of this view requires superuser permissions.

sotuoid=> The object ID of the table

sotusize=> The uncomressed size of the table in bytes if it is a compressed AO table. Otherwise, the table size on disk.

sotuschemaname=> The schema name

sotutablename=> The table name

10. gp_disk_free: This external table runs the df (disk free) command on the active segment hosts and reports back the results. Inactive mirrors are not included in the calculation. The use of this external table requires superuser permissions.

dfsegment=> The content id of the segment (only active segments are shown)

dfhostname=> The hostname of the segment host 

dfdevice=> The device name 

dfspace=> Free disk space in the segment file system in kilobytes

5. Monitoring distributions and partitions 

a) Checking for Uneven Data Distribution: All tables in Greenplum Database are distributed, meaning their data is divided across all of the segments in the system. If the data is not distributed evenly, then query processing performance may suffer. The following views can help diagnose if a table has uneven data distribution:

1. gp_skew_coefficients

2.  gp_skew_idle_fractions

Lets look into these views.

1. gp_skew_coefficients: This view shows data distribution skew by calculating the coefficient of variation (CV) for the data stored on each segment. This view is accessible to all users, however non-superusers will only be able to see tables that they have permission to access.

skcoid=> The object id of the table.

skcnamespace=> The namespace where the table is defined.

skcrelname=> The table name.

skccoeff=> The coefficient of variation (CV) is calculated as the standard deviation divided by the average. It takes into account both the average and variability around the average of a data series. The lower the value, the better. Higher values indicate greater data skew.

2. gp_skew_idle_fractions: This view shows data distribution skew by calculating the percentage of the system that is idle during a table scan, which is an indicator of processing data skew. This view is accessible to all users, however non-superusers will only be able to see tables that they have permission to access.

sifoid=> The object id of the table.

sifnamespace=> The namespace where the table is defined.

sifrelname=> The table name.

siffraction=> The percentage of the system that is idle during a table scan, which is an indicator of uneven data distribution or query processing skew. For example, a value of 0.1 indicates 10% skew, a value of 0.5 indicates 50% skew, and so on. Tables that have more than 10% skew should have their distribution policies evaluated.