greenplum partition optimization

Question

On greenplum, I have a big table called fact_table which is partitioned by RANGE(day_bucket). Why is so slow for the following query:

select max(day_bucket) from fact_table where day_bucket >= '2011-09-11 00:00:00' and day_bucket < '2011-12-14'.

I suppose it should just look at the head of each partition and return the result immediately, since each partition of the same day_bucket column. But greenplum did a FULL scan to calculate the result. Anybody can explain to me the reason?

Update:

Thanks for answering my question, but it doesn't help with your tip. Greenplum always do a full scan, even though I create the table with PARTITION BY LIST(day_bucket):

CREATE TABLE fact_table (
    id character varying(25) NOT NULL,
    day_bucket timestamp without time zone NOT NULL,
)
WITH (appendonly=true, orientation=column, compresstype=zlib, compresslevel=6) DISTRIBUTED BY (user_id) PARTITION BY LIST(day_bucket) 
          (
          PARTITION p20120101 VALUES ('2012-01-01 00:00:00'::timestamp without time zone) WITH (tablename='fact_table_1_prt_p20120101', appendonly=true, orientation=column, compresstype=zlib, compresslevel=6 ), 
          PARTITION p20120102 VALUES ('2012-01-02 00:00:00'::timestamp without time zone) WITH (tablename='fact_table_1_prt_p20120102', appendonly=true, orientation=column, compresstype=zlib, compresslevel=6 ), 
          PARTITION p20120103 VALUES ('2012-01-03 00:00:00'::timestamp without time zone) WITH (tablename='fact_table_1_prt_p20120103', appendonly=true, orientation=column, compresstype=zlib, compresslevel=6 ), 
          PARTITION p20120104 VALUES ('2012-01-04 00:00:00'::timestamp without time zone) WITH (tablename='fact_table_1_prt_p20120104', appendonly=true, orientation=column, compresstype=zlib, compresslevel=6 ), 
       .....

Explain command shows that it always do a full scan:

-> Append-only Columnar Scan on mytestlist_1_prt_p20120102 mytestlist (cost=0.00..34.95 rows=1 width=8) Filter: day_bucket >= '2012-01-02 00:00:00'::timestamp without time zone AND day_bucket Append-only Columnar Scan on mytestlist_1_prt_p20120103 mytestlist (cost=0.00..39.61 rows=1 width=8) Filter: day_bucket >= '2012-01-02 00:00:00'::timestamp without time zone AND day_bucket

Are there any indices on the table? How many rows, on average, does each partition have? — user554546
In the future please edit your question to add more information. And, I have merged the extra account you appear to have accidentally created. — jjnguy
What version of GPDB are you using? Can you post a full explain plan? — Bart K

Yury Kochubeev Yury Kochubeev · Accepted Answer · 2011-12-16T09:03:07

You should pay attention to constraints applied to your partitions. To allow optimizer correctly exclude some partitions from scanning, you should help him. In your case, you should use explicit typecasting: (GP could not automatically understand on the planning stage that stings like 'yyyy-mm-dd' is actually timestamp)

select max(day_bucket) 
from fact_table 
where day_bucket >= '2011-09-11 00:00:00'::timestamp 
  and day_bucket <  '2011-12-14'::timestamp

greenplum partition optimization

1 Answers