Greenplum DBA routine tasks

Routine Maintenance Tasks
1. Vacuum and Analyze tables
2. System Catalog Maintenance
3. Reindex Tables
4. Disk space monitoring  (checking database object sizes and disk space) and cleanup 
5. Monitoring distributions and partitions  
6. Managing Log Files
7. Monitor bloats and disk spills
8. Monitoring long running queries
9. DCA health check
10. Monitoring cron jobs and backups 
11 Monitoring user activity and resource consumption

Vacuum and Analyze tables

Greenplum uses the MVCC transaction model. Rows that are deleted or updated are expired but not visible to new transactions. Expired rows will take up physical space in the database until they are permanently deleted using the vacuum command. Vacuum tables periodically to remove expired rows. 

Recommendation is to run vacuum against your tables on a weekly basis. More frequently if there are a lot of updates and deletes occurring. Always vacuum after new data has been loaded. This will cut down on a performance issue known as table bloat.  Expired rows are held in the free space map. 

Bloated tables kill query performance!!

If this becomes full to overflowing, vacuum cannot reclaim space from expired rows. Vacuum also collects table statistics. It is not recommended to run a vacuum full on the database. Vacuum with Analyze: Greenplum uses a cost-based query planner that depends on accurate database statistics to run sql queries optimally. 
Tables that are updated frequently end up with stale statistics. The Analyze command collects column-level statics for the query planner to use

# vacuum analyze mytable;

Recommended: run vacuum analyze on your largest, most frequently updated tables. 

Note:  vacuumdb is a wrapper that does nothing more than vacuum
2. Catalog Maintenance
The system catalog becomes bloated when there are lot’s of database updates with the create and delete commands. Over time this will adversely affect Greenplum performance. If left unchecked, the system catalog may become too bloated and a vacuum full will need to be performed on the system catalog. Run vacuum on the system catalog regularly. For a system that has lot’s of drop commands, run nightly.

The following example script performs a VACUUM of the Greenplum Database system catalog:

#!/bin/bash
DBNAME="<database_name>"
VCOMMAND="VACUUM ANALYZE"
psql -tc "select '$VCOMMAND' || ' pg_catalog.' || relname || ';' from pg_class a,pg_namespace b where a.relnamespace=b.oid and b.nspname= 'pg_catalog' 
and a.relkind='r'" $DBNAME | psql -a
$DBNAME


Intensive System Catalog Maintenance

If a system catalog maintenance procedure has not been performed in a long time, the catalog can become bloated with dead space; this causes excessively long wait times for simple metadata operations. A wait of more than two seconds to list user tables, such as with the \d metacommand from within psql, is an indication of catalog bloat.

If you see indications of system catalog bloat, you must perform an intensive system catalog maintenance procedure with VACUUM FULL during a scheduled downtime period. During this period, stop all catalog activity on the system; the FULL system catalog maintenance procedure takes exclusive locks against the system catalog.

Running regular system catalog maintenance procedures can prevent the need for this more costly procedure.

3. Reindex Tables
Rebuild poorly performing indexes using the reindex command. This reclaims wasted space in the index. It is generally faster to drop and create an index than use
reindex to resolve index performance issues. Update and Delete operations do not update bitmap indexes. Use reindex instead.

Reindexdb utility

# reindex mytable; rebuilds all of the indexes on a table
# reindex myindex; rebuilds a particular index

Wrapper around the reindex command

# reindexdb –table <table> –index <index> <database>

Other command line switches documented in the Greenplum Utilities Guide
# reindex --system
Use to reindex the system catalog periodically as a Best Practice
# reindex --all
Use to reindex all indexes in all of the databases. This may run for a long time

4. Disk Space and Object size monitoring 

a) Checking database object sizes and disk space

The gp_size_* family of views can be used to determine the disk space usage for a distributed Greenplum database, schema, table, or index. The following views 
calculate the total size of an object across all primary segments (mirrors are not included in the size calculations).

1.gp_size_of_all_table_indexes
2. gp_size_of_database
3. gp_size_of_index
4. gp_size_of_partition_and_indexes_disk
5. gp_size_of_schema_disk
6. gp_size_of_table_and_indexes_disk
7. gp_size_of_table_and_indexes_licensing
8. gp_size_of_table_disk
9. gp_size_of_table_uncompressed
10. gp_disk_free

The table and index sizing views list the relation by object ID (not by name). To check the size of a table or index by name, you must look up the relation name (relname) in the pg_class table. 

For example:
SELECT relname as name, sotdsize as size, sotdtoastsize as toast, sotdadditionalsize as other 
FROM gp_size_of_table_disk as sotd, pg_class 
WHERE sotd.sotdoid=pg_class.oid ORDER BY relname;

Lets look into details of each view.

Note: views are accessible to all users, however non-superusers will only be able to see relations that they have permission to access.
1. gp_size_of_all_table_indexes: This view shows the total size of all indexes for a table. This view is accessible to all users, however non-superusers will only be able to see relations that they have permission to access.
soatioid=> The object ID of the table
soatisize=> The total size of all table indexes in bytes
soatischemaname=> The schema name
soatitablename=> The table name

2. gp_size_of_database: This view shows the total size of a database. This view is accessible to all users, however non-superusers will only be able to see databases that they have permission to access.
sodddatname=> The name of the database
sodddatsize=> The size of the database in bytes

3. gp_size_of_index: This view shows the total size of an index. This view is accessible to all users, however non-superusers will only be able to see relations that they have permission to access.
soioid=> The object ID of the index
soitableoid=> The object ID of the table to which the index belongs
soisize=> The size of the index in bytes
soiindexschemaname=> The name of the index schema
soiindexname=> The name of the index
soitableschemaname=> The name of the table schema
soitablename=> The name of the table

4. gp_size_of_partition_and_indexes_disk: This view shows the size on disk of partitioned child tables and their indexes. This view is accessible to all users, however non-superusers will only be able to see relations that they have permission to access.
sopaidparentoid=> The object ID of the parent table
sopaidpartitionoid=> The object ID of the partition table
sopaidpartitiontablesize=> The partition table size in bytes
sopaidpartitionindexessize=> The total size of all indexes on this partition
Sopaidparentschemaname=> The name of the parent schema
Sopaidparenttablename=> The name of the parent table
Sopaidpartitionschemaname=> The name of the partition schema
sopaidpartitiontablename=> The name of the partition table

5. gp_size_of_schema_disk: This view shows schema sizes for the schemas in the current database. This view is accessible to all users, however non-superusers will only be able to see schemas that they have permission to access.
sosdnsp=> The name of the schema
sosdschematablesize=> The total size of tables in the schema in bytes
sosdschemaidxsize=> The total size of indexes in the schema in bytes

6. gp_size_of_table_and_indexes_disk : This view shows the size on disk of tables and their indexes. This view is accessible to all users, however non-superusers will only be able to see relations that they have permission to access.
sotaidoid=> The object ID of the parent table
sotaidtablesize=> The disk size of the table
sotaididxsize=> The total size of all indexes on the table
sotaidschemaname=> The name of the schema
sotaidtablename=> The name of the table

7. gp_size_of_table_and_indexes_licensing: This view shows the total size of tables and their indexes for licensing purposes. The use of this view requires superuser permissions.
sotailoid=> The object ID of the table
sotailtablesizedisk=> The total disk size of the table
sotailtablesizeuncompressed=> If the table is a compressed append-only table, shows the uncompressed table size in bytes.
sotailindexessize=> The total size of all indexes in the table
sotailschemaname=> The schema name
sotailtablename=> The table name

8. gp_size_of_table_disk : This view shows the size of a table on disk. This view is accessible to all users, however non-superusers will only be able to see tables that they have permission to access.
sotdoid=> The object ID of the table
sotdsize=> The total size of the table in bytes (main relation, plus oversized (toast) attributes, plus additional storage objects for AO tables).
sotdtoastsize=> The size of the TOAST table (oversized attribute storage), if there is one.
sotdadditionalsize=> Reflects the segment and block directory table sizes for append-only (AO) tables.
sotdschemaname=> The schema name
sotdtablename=> The table name

9. gp_size_of_table_uncompressed: This view shows the uncompressed table size for append-only (AO) tables. Otherwise, the table size on disk is shown. The use of this view requires superuser permissions.
sotuoid=> The object ID of the table
sotusize=> The uncomressed size of the table in bytes if it is a compressed AO table. Otherwise, the table size on disk.
sotuschemaname=> The schema name
sotutablename=> The table name

10. gp_disk_free: This external table runs the df (disk free) command on the active segment hosts and reports back the results. Inactive mirrors are not included in the calculation. The use of this external table requires superuser permissions.
dfsegment=> The content id of the segment (only active segments are shown)
dfhostname=> The hostname of the segment host 
dfdevice=> The device name 
dfspace=> Free disk space in the segment file system in kilobytes


5. Monitoring distributions and partitions 
a) Checking for Uneven Data DistributionAll tables in Greenplum Database are distributed, meaning their data is divided across all of the segments in the system. If the data is not distributed evenly, then query processing performance may suffer. The following views can help diagnose if a table has uneven data distribution:

1. gp_skew_coefficients
2.  gp_skew_idle_fractions

Lets look into these views.

1. gp_skew_coefficients: This view shows data distribution skew by calculating the coefficient of variation (CV) for the data stored on each segment. This view is accessible to all users, however non-superusers will only be able to see tables that they have permission to access.
skcoid=> The object id of the table.
skcnamespace=> The namespace where the table is defined.
skcrelname=> The table name.
skccoeff=> The coefficient of variation (CV) is calculated as the standard deviation divided by the average. It takes into account both the average and variability around the average of a data series. The lower the value, the better. Higher values indicate greater data skew.

2. gp_skew_idle_fractions: This view shows data distribution skew by calculating the percentage of the system that is idle during a table scan, which is an indicator of processing data skew. This view is accessible to all users, however non-superusers will only be able to see tables that they have permission to access.

sifoid=> The object id of the table.
sifnamespace=> The namespace where the table is defined.
sifrelname=> The table name.
siffraction=> The percentage of the system that is idle during a table scan, which is an indicator of uneven data distribution or query processing skew. For example, a value of 0.1 indicates 10% skew, a value of 0.5 indicates 50% skew, and so on. Tables that have more than 10% skew should have their distribution policies evaluated.

Comments