We have migrated to new Google sites!
Post date: Oct 05, 2012 10:29:39 PM
The optimizer of Postgres/Greenplum is cost based and selects the execution plan with the lowest cost. In Greenplum, the cost for one scattered IO access is measured by the random_page_cost parameter, which is 100 by default. Comparing to 1 in postgres, it is quite large. It is being set in this way because we want to choose seq scan as frequently as possible in most cases.
For some special cases, the random_page_cost change could be used to make the execution plan manually optimized.
Example:
gpadmin=# create table t as select generate_series x from generate_series(1,10000);
NOTICE: Table doesn't have 'DISTRIBUTED BY' clause -- Using column(s) named 'x' as the Greenplum Database data distribution key for this table.
HINT: The 'DISTRIBUTED BY' clause determines the distribution of data. Make sure column(s) chosen are the optimal data distribution key to minimize skew.
SELECT 10000
gpadmin=# create index f_idx on t( lower(x));
CREATE INDEX
gpadmin=# explain select * from t where lower(x) = '1';
QUERY PLAN
------------------------------------------------------------------------------
Gather Motion 8:1 (slice1; segments: 8) (cost=0.00..191.00 rows=2 width=4)
-> Seq Scan on t (cost=0.00..191.00 rows=2 width=4)
Filter: lower(x::text) = '1'::text
(3 rows)
gpadmin=# set random_page_cost to 1;
SET
gpadmin=# explain select * from t where lower(x) = '1';
QUERY PLAN
----------------------------------------------------------------------------
Gather Motion 8:1 (slice1; segments: 8) (cost=0.00..9.43 rows=2 width=4)
-> Index Scan using f_idx on t (cost=0.00..9.43 rows=2 width=4)
Index Cond: lower(x::text) = '1'::text
Settings: random_page_cost=1
(4 rows)