I have a bucket in S3 with parquet files and partitioned by dates.
With the following query:
select
count(1)
from logs.logs_prod
where partition_1 = '2019' and partition_2 = '03'
Running that query in Athena directly, it executes in less than 10 seconds. But when I run the same query in Redshift, it is taking over 3 minutes. Both of them return the same correct value and, in this case, there are less than 80 thousand rows in that partition.
I'm using AWS Glue as Metadata Store for both Athena and Redshift.
The query plan for Redshift is the following:
QUERY PLAN
XN Limit (cost=250000037.51..250000037.51 rows=1 width=8)
-> XN Aggregate (cost=250000037.51..250000037.51 rows=1 width=8)
-> XN Partition Loop (cost=250000000.00..250000035.00 rows=1000 width=8)
-> XN Seq Scan PartitionInfo of logs.logs_prod (cost=0.00..15.00 rows=1 width=0)
Filter: (((partition_1)::text = '2019'::text) AND ((partition_2)::text = '03'::text))
-> XN S3 Query Scan logs_prod (cost=125000000.00..125000010.00 rows=1000 width=8)
-> S3 Aggregate (cost=125000000.00..125000000.00 rows=1000 width=0)
-> S3 Seq Scan logs.logs_prod location:"s3://logs-prod/" format:PARQUET (cost=0.00..100000000.00 rows=10000000000 width=0)
Is this issue a Redshift Spectrum configuration problem? Is it possible that the query in Redshift won't perform close to Athena?
SELECT uuid FROM spectrum.telemetry_event WHERE "tag" = 'my.test.tag' AND date = '2020-05-06';Query plan:XN Partition Loop \n -> XN Seq Scan \n Filter \n -> XN S3 Query Scan \n -> S3 Seq Scan- chas