How to free space from expired rows / deleted rows in greenplum?

posted Sep 21, 2012, 7:54 AM by Sachi Ojha
Ans: Because of the MVCC transaction concurrency model, data rows that are deleted or updated still occupy physical space on disk even though they are not visible to any new transactions. If you have a database with lots of updates and deletes, you will generate a lot of expired rows. Periodically running the VACUUM command will remove these expired rows. For example:

VACUUM mytable;

The VACUUM command also collects table-level statistics such as number of rows and pages, so it is necessary to vacuum all tables after loading data, including append-only tables.

Vacuuming append-only tables should be instantaneous since there will be no space to reclaim.

We also recommend all greenplum DBA's to perform routine vacuum operations.

Transaction ID Management
Greenplum’s MVCC transaction semantics depend on being able to compare transaction ID (XID) numbers to determine visibility to other transactions. But since transaction IDs have limited size, a Greenplum system that runs for a long time (more than 4 billion transactions) would suffer transaction ID wraparound: the XID counter wraps around to zero, and all of a sudden transactions that were in the past appear to be in the future — which means their outputs become invisible. To avoid this, it is necessary to VACUUM every table in every database at least once every two billion transactions.

System Catalog Maintenance
Numerous database updates with CREATE and DROP commands can cause growth in the size of the system catalog that affects system performance. For example, after a large number of DROP TABLE statements, the overall performance of the system begins to degrade due to excessive data scanning during metadata operations on the catalog tables. Depending on your system, the performance loss may occur between thousands to tens of thousands of DROP TABLE statements.

Greenplum recommends that you regularly run a system catalog maintenance procedure to reclaim the space occupied by deleted objects. If a regular procedure has not been run for a long time, you may need to run a more intensive procedure to clear the system catalog. Both types of procedures are described in this section.

Regular System Catalog Maintenance
Greenplum recommends that you periodically run VACUUM on the system catalog to clear the space occupied by deleted objects. If numerous DROP statements are a part of regular database operations, it is safe and appropriate to run a system catalog maintenance procedure with VACUUM daily at off-peak hours. This can be done while the system is running and available.

The following example script performs a VACUUM of the Greenplum Database system catalog:

psql -tc "select '$VCOMMAND' || ' pg_catalog.' || relname || ';' from pg_class a,pg_namespace b where a.relnamespace=b.oid and b.nspname= 'pg_catalog' and a.relkind='r'" $DBNAME | psql -a $DBNAME

Intensive System Catalog Maintenance
If a system catalog maintenance procedure has not been performed in a long time, the catalog may become bloated with dead space, causing excessively long wait times for simple metadata operations. A wait of more than one or two seconds to list user tables, such as with the \d metacommand from within psql, is an indication of catalog bloat.

If you see indications of system catalog bloat, an intensive system catalog maintenance procedure with VACUUM FULL must be performed during a scheduled downtime period. During this period you must stop all catalog activity on the system due to the exclusive locks taken against the system catalog by the FULL system catalog maintenance procedure.

Running regular system catalog maintenance procedures can prevent the need for the more costly intensive procedure.

Vacuum and Analyze for Query Optimization
Greenplum Database uses a cost-based query planner that relies on database statistics. Accurate statistics allow the query planner to better estimate selectivity and the number of rows retrieved by a query operation in order to choose the most efficient query plan. The ANALYZE command collects column-level statistics needed by the query planner.

Both VACUUM and ANALYZE operations can be run in the same command. For example:
=# VACUUM ANALYZE mytable;