Fixing size bloat in Greenplum tables
Post date: Sep 11, 2014 2:46:21 AM
I noticed recently that some queries on a table were running very slow. Simple counts were taking longer on what appeared to be smaller tables. I say appeared because the tables had 1/5 the number of rows as other tables but queries were slower. The raw data files contained about 14 GB of data. To see what the Greenplum system had for table size I ran this query:
The answer was 130GB! Clearly there were problems. This table does get reloaded every month with new data but I truncate the table before reloading it so you aren't supposed to run into issues. Anyway there turned out to be a couple of solutions. One was to run a vacuum full on the table. After running that the table size was reported as 13.635 MB. I did try a vacuum on the table but it had no impact on size. Another solution is to redistribute the data randomly then redistribute by the table key.
ALTER TABLE schema.table_name SET DISTRIBUTED RANDOMLY;
ALTER TABLE schema.table_name SET DISTRIBUTED BY (table_key);
The first statement will not actually redistribute the data. To make the system actually redistribute the data randomly run this:
ALTER TABLE schema.table_name SET WITH (REORGANIZE=true);
Redistributing by a column name will actually physically redistribute the data.