Greenplum database maintenance and monitoring

Regular database monitoring and maintenance is necessary to keep your database optimal for the applications and user activities. There are few things that you need to monitor on daily/weekly basis. We will discuss some of the important activities that you need to monitor actively.

check for FATAL and ERROR log messages from the system

posted Oct 9, 2014, 3:04 PM by Sachchida Ojha

Run SQL

select * from log_alert_history where logseverity in ('FATAL', 'ERROR') and logtime > (now() - interval '15 minutes');

Analyze the alert. You may want to add additional filters to the query to ignore certain messages of low interest

Basic check to see if the master is up and functioning

posted Oct 9, 2014, 2:55 PM by Sachchida Ojha

Run SQL

select count(*) from gp_segment_configuration;

If this fails the active master may be down. Try again several times and then inspect the active master manually. If the active master is down, reboot or power cycle the active master to ensure no processes remain on the active master and then trigger the activation of the standby master.

Check the state of the master mirroring.

posted Oct 9, 2014, 2:50 PM by Sachchida Ojha   [ updated Oct 9, 2014, 2:53 PM ]

Run SQL

select summary_state from gp_master_mirroring;

Above query test the state of the master mirroring. if the value is "Not Synchronized" Check the pg_log from the master and standby master for errors. If there are no unexpected errors and the machines are up, run gpinitstandby to bring the standby online. This will require a database restart on GPDB 4.2 and earlier.

SELECT procpid, state FROM pg_stat_replication; -- GPDB 4.3 and later

Above SQL test the state of the master mirroring. if the value is "STREAMING", Check the pg_log from the master and standby master for errors. If there are no unexpected errors and
the machines are up, run gpinitstandby to bring the standby online. 

Check if distributed query is functional on all segments

posted Oct 9, 2014, 2:47 PM by Sachchida Ojha

Run SQL

select gp_segment_id, count(*) from gp_dist_random('pg_class') group by 1 order by 1;

This query test to make sure it is functional on all segments. one row should be returned for each primary segment. If this fails there is an issue dispatching to some segments in the cluster. This is a rare event. Check the hosts that are not able to be dispatched to ensure there is no hardware or networking issue.

Check segments that are in a non-optimal role

posted Oct 9, 2014, 2:43 PM by Sachchida Ojha

Run SQL

select * from gp_segment_configuration where preferred_role <> role;

The segments are not running in their preferred role which implies a skew in processing across the cluster with an uneven # of primary segments on each host. Wait for a potential window and restart the database to bring the segments into preferred role.

Check segments that are currently resyncing

posted Oct 9, 2014, 2:41 PM by Sachchida Ojha

Run SQL

select * from gp_segment_configuration where mode = 'r';

This implies that the segment is in the process of getting resynched. If the state does not change from 'r' to 's' then check the pg_log from the primary and mirror of the impacted 
segments for errors.

Check segments that are currently in change tracking mode

posted Oct 9, 2014, 2:38 PM by Sachchida Ojha

RUN SQL

select * from gp_segment_configuration where mode = 'c';

Check to see if the host that the down segments are from, is responsive. If yes, check the pg_log for the primary and mirror of the down segments to find out the root cause of the segments going done. If no unexpected errors, then run gprecoverseg to bring the segments back online

Check if one or more Greenplum segments are down

posted Oct 9, 2014, 2:12 PM by Sachchida Ojha

Run SQL

select * from gp_segment_configuration where status <> 'u';

This query will return segments that are currently DOWN. If rows are more than 0 this should generate a warning/alert. Check to see if the host that the down segments are from, is responsive. If yes, check the pg_log for the primary and mirror of the down segments to find out the root cause of the segments going done. If no unexpected errors, then run gprecoverseg to bring the segments back online.

1-8 of 8