greenplum database new features - Incremental Analyze

Incremental Analyze is much awaited features of the Greenplum database. Analyzing large (partitioned) tables are the most time consuming operation during the regular database maintenance activity. Lets explore this feature a little further.

Incremental Analyze (4.3.5.0)

Greenplum database 4.3.5.0 introduced a new utility called analyzedb. As of 

[gpadmin@localhost gpsne-1]$ analyzedb --help

Usage: analyzedb [options] 

Analyze a database incrementally. 'Incremental' means if a table or partition has not been modified by DML or DDL commands since the last analyzedb run, it will be automatically skipped since its statistics must be up to date. Some restrictions apply:

1. The incremental semantics only applies to append-only tables or partitions. 

2. All heap tables are regarded as having stale stats every time analyzedb is run. This is because we use AO metadata to check for DML or DDL events, which is not available to heap tables. 

 

3. Catalog tables, views and external tables are automatically skipped.

Options:

  --version             show program's version number and exit

  -d <database name>    Database name. Required.

  -s <schema name>      Specify a schema to analyze. All tables in the schema

                        will be analyzed.

  -t <schema name>.<table name>

                        Analyze a single table. Table name needs to be

                        qualified with schema name.

  -i <column1>,<column2>,...

                        Columns to include to be analyzed, separated by comma.

                        All columns will be analyzed if not specified.

  -x <column1>,<column2>,...

                        Columns to exclude to be analyzed, separated by comma.

                        All columns will be analyzed if not specified.

  -f <config_file>, --file=<config_file>

                        Config file that includes a list of tables to be

                        analyzed. Table names must be qualified with schema

                        name. Optionally a list of columns (separated by

                        comma) can be specified using -i or -x.

  -l, --list            List the tables to be analyzed without actually

                        running analyze (dry run).

  -p <parallel level>   Parallel level, i.e. the number of tables to be

                        analyzed in parallel. Valid numbers are between 1 and

                        10. Default value is 5.

  --skip_root_stats     Skip refreshing root partition stats if any of the

                        leaf partitions is analyzed.

  --full                Analyze without using incremental. All tables

                        requested by the user will be analyzed.

  --clean_last          Clean the state files generated by last analyzedb run.

                        All other options except -d will be ignored.

  --clean_all           Clean all the state files generated by analyzedb. All

                        other options except -d will be ignored.

  -h, -?, --help        Show this help message and exit.

  -v, --verbose         Print debug messages.

  -a                    Quiet mode. Do not prompt for user confirmation.

[gpadmin@localhost gpsne-1]$ 

While performing the ANALYZE operations, analyzedb creates a snapshot of table metadata snapshot and stores it on disk on the master host. If you run analyzedb to ANALYZE an append-optimized table, the utility checks the metadata taken during a previous analyzedb operation to determine if the table has been modified. An ANALYZE operation is performed only if the table has been modified. If a table or partition has not been modified since the last time it was analyzed, analyzedb automatically skips the table or partition because it already contains up-to-date statistics.

For a partitioned, append-optimized table, analyzedb checks the partitioned table root partition and leaf partitions. If needed, the utility updates statistics for partitions that have changed and, if necessary, the root partition.

To perform the ANALYZE operations, analyzedb creates concurrent sessions to analyze tables in parallel. For each session,analyzedb issues an ANALYZE command to the database and specifies different table names.

This example command specifies a file that contains a list of tables. The command collects statistics on the tables listed in the file analyze-tables in the database named mytest.

$ analyzedb -d sachi -f analyze-table_list