Understanding data compressions in Greenplum

Note: The pg_compression system catalog table describes the compression methods available

sachi=# \d pg_catalog.pg_compression

Table "pg_catalog.pg_compression"

Column | Type | Modifiers

------------------+---------+-----------

compname | name | not null

compconstructor | regproc | not null

compdestructor | regproc | not null

compcompressor | regproc | not null

compdecompressor | regproc | not null

compvalidator | regproc | not null

compowner | oid | not null

Indexes:

"pg_compression_compname_index" UNIQUE, btree (compname)

"pg_compression_oid_index" UNIQUE, btree (oid)

Note: Storage size optimizations are available in column-oriented data that are not available in row-oriented data. Keep in mind the greater adjacent compression achieved, the more difficult random access can become, as data must be uncompressed to be read.

Compression Considerations

Append Only Tables

Minimizing disk size is one factor, but consider the time and CPU required to compress, uncompress and scan data. Find the optimal settings for efficiently compressing data without causing excessively long compression times or slow scan rates. QUICKLZ compression generally uses less CPU capacity and compresses data faster at a lower compression ratio than ZLIB (1-9). ZLIB provides higher compression ratios at lower speeds. There are four levels(1-4) of RLE compression available. The levels progressively increase the compression ratio, but decrease the compression speed.

Do not use GPDB compression on file system that use compression. For additional information refer to the GPDB Database Administrator Guide.

gp_workfile_compress_algorithm Default value: none

When a hash aggregation or hash join operation spills to disk, specifies the compression algorithm to use on the spill files If using zlib, it must be in your ${PATH} on all segments for user gpadmin

Recommended to compress spill files if the system is I/O bound

Greenplum Database supports several storage models and a mix of storage models. When you create a table, you choose how to store its data.

This presentation explains the options for table storage and how to choose the best storage model for your workload.

Using Compression (Append-Only Tables Only)

Checking Compression and Distribution of an Append-Only Table

When choosing a compression type and level for append-only tables, consider these factors:

CPU usage. Your segment systems must have the available CPU power to compress and uncompress the data.

Compression ratio/disk size. Minimizing disk size is one factor, but also consider the time and CPU capacity required to compress and scan data. Find the optimal settings for efficiently compressing data without causing excessively long compression times or slow scan rates.

Speed of compression. QuickLZ compression generally uses less CPU capacity and compresses data faster at a lower compression ratio than zlib. zlib provides higher compression ratios at lower speeds. 

For example, at compression level 1 (compresslevel=1), QuickLZ and zlib have comparable compression ratios, though at different speeds. Using zlib with compresslevel=6 can significantly increase the compression ratio compared to QuickLZ, though with lower compression speed.

Speed of decompression/scan rate. Performance with compressed append-only tables depends on hardware, query tuning settings, and other factors. Perform comparison testing to determine the actual performance in your environment.

Do not use compressed append-only tables on file systems that use compression. If the file system on which your segment data directory resides is a compressed file system, your append-only table must not use compression.

Note: Do not use compressed append-only tables on file systems that use compression. If the file system on which your segment data directory resides is a compressed file system, your append-only table must not use compression.

Performance with compressed append-only tables depends on hardware, query tuning settings, and other factors. Greenplum recommends performing comparison testing to determine the actual performance in your environment.

Note: QuickLZ compression level can only be set to level 1; no other options are available. Compression level with zlib can be set at any value between 1 and 9.

Note: When an ENCODING clause conflicts with a WITH clause, the ENCODING clause has higher precedence than the WITH clause.

The WITH clause of the CREATE TABLE command declares the table storage options. Tables that use compression must be append-only tables. For example, to create an append-only table with zlib compression at a compression level of 5:

CREATE TABLE mytable (a int, b char(20))

WITH (appendonly=true, compresstype=zlib, compresslevel=5) DISTRIBUTED BY (a);

Greenplum provides built-in functions to check the compression ratio and the distribution of an append-only table. The functions take either the object ID or a table name. You can qualify the table name with a schema name.

The compression ratio is returned as a common ratio. For example, a returned value of 3.19, or 3.19:1, means that the uncompressed table is slightly larger than three times the size of the compressed table.

sachi=# select * from pg_catalog.pg_compression;

compname | compconstructor | compdestructor | compcompressor | compdecompressor | co

mpvalidator | compowner

----------+----------------------------------+---------------------------------+-------------------------------+---------------------------------+-----------

---------------------+-----------

zlib | gp_zlib_constructor | gp_zlib_destructor | gp_zlib_compress | gp_zlib_decompress | gp_zlib_va

lidator | 10

quicklz | gp_quicklz_constructor | gp_quicklz_destructor | gp_quicklz_compress | gp_quicklz_decompress | gp_quicklz

_validator | 10

rle_type | gp_rle_type_constructor | gp_rle_type_destructor | gp_rle_type_compress | gp_rle_type_decompress | gp_rle_typ

e_validator | 10

none | gp_dummy_compression_constructor | gp_dummy_compression_destructor | gp_dummy_compression_compress | gp_dummy_compression_decompress | gp_dummy_c

ompression_validator | 10

(4 rows)

Sachi=# select * from gp_toolkit.gp_size_of_table_uncompressed where sotusize > 1000000 limit 10; Sachi=# select * from gp_toolkit.gp_size_of_table_disk where sotdtablename = 'sales';

Q. What is compression ratio like for greenplum column oriented table

Ans: Greenplum row compression is only for Append Only and is about 35% - 40% reduction in volume. Row compression is done at the partition level, and once, compressed it is not updatable. However, Pivotal Greenplum is working on future feature release to have it updatable. The compression using compresstype=quicklz with default=1 is about 6 times the compression ratio. Eg. for 2 TB the table compressed will be around 360GB using compress_type "quicklz" with default=1. This are approximation, so the conclusion is 6 times compressed ratio. The following built-in function is provided to check the compression ratio of an append-only compressed table, and it takes table name or object ID. The function calculates the compression ratio: 

get_ao_compression_ratio(name)

get_ao_compression_ratio(oid)

3|uncompressed|public.foo_tab|Fri Nov 14 14:30:00 EST 2014|1415993400|1415993536|136|100000000|2602065248|2482 MB

3|zlib_5|public.foo_tab|Fri Nov 14 14:32:23 EST 2014|1415993543|1415993635|92|100000000|621586400|593 MB

3|zlib_7|public.foo_tab|Fri Nov 14 14:34:05 EST 2014|1415993645|1415993753|108|100000000|619200880|591 MB

3|quicklz|public.foo_tab|Fri Nov 14 14:36:03 EST 2014|1415993763|1415993810|47|100000000|835938248|797 MB

My colleague RAM has created a table foo_tab and it has 100 million rows, and you can see from the above that actual table is 3 times larger than the quicklz compressed table, and 4 times larger than the zlib level 5 compressed table. The procedure returns similar compression ratio for quicklz and zlib level 5 compressed tables:

gpdb=# select get_ao_compression_ratio('public.foo_tab_quicklz');

get_ao_compression_ratio

--------------------------

3.11

(1 row)

gpdb=# select get_ao_compression_ratio('public.foo_tab_zlib5');

get_ao_compression_ratio

--------------------------

4.19

(1 row)

There is little difference in compressed table size and compression ratio between zlib level 5 and level 7 append-only tables.