1
votes

I have a bucket in S3 with parquet files and partitioned by dates.

With the following query:

select
    count(1)
from logs.logs_prod
where partition_1 = '2019' and partition_2 = '03'

Running that query in Athena directly, it executes in less than 10 seconds. But when I run the same query in Redshift, it is taking over 3 minutes. Both of them return the same correct value and, in this case, there are less than 80 thousand rows in that partition.

I'm using AWS Glue as Metadata Store for both Athena and Redshift.

The query plan for Redshift is the following:

QUERY PLAN
XN Limit  (cost=250000037.51..250000037.51 rows=1 width=8)
  ->  XN Aggregate  (cost=250000037.51..250000037.51 rows=1 width=8)
        ->  XN Partition Loop  (cost=250000000.00..250000035.00 rows=1000 width=8)
              ->  XN Seq Scan PartitionInfo of logs.logs_prod  (cost=0.00..15.00 rows=1 width=0)
                    Filter: (((partition_1)::text = '2019'::text) AND ((partition_2)::text = '03'::text))
              ->  XN S3 Query Scan logs_prod  (cost=125000000.00..125000010.00 rows=1000 width=8)
                    ->  S3 Aggregate  (cost=125000000.00..125000000.00 rows=1000 width=0)
                          ->  S3 Seq Scan logs.logs_prod location:"s3://logs-prod/" format:PARQUET  (cost=0.00..100000000.00 rows=10000000000 width=0)

Is this issue a Redshift Spectrum configuration problem? Is it possible that the query in Redshift won't perform close to Athena?

1
I'm running into what looks like a similar issue. I have a very simple query on an external table with parquet files partitioned by a varchar and a date. I write a query targeting a single partition with a single parquet file, and it takes over a minute to return. I don't want to answer with a question, so I apologize for formatting... Query: SELECT uuid FROM spectrum.telemetry_event WHERE "tag" = 'my.test.tag' AND date = '2020-05-06'; Query plan: XN Partition Loop \n -> XN Seq Scan \n Filter \n -> XN S3 Query Scan \n -> S3 Seq Scan - chas

1 Answers

0
votes

I don't think you should put too much weight to this test. From the plan it looks like it's not taking advantage of the fact that Parquet files contain metadata about the number of rows in each file – which is something I believe Athena/Parquet can do.

The actual real world performance of Athena vs. Redshift Spectrum is difficult to measure since with Athena you don't know how much capacity you get (but it's a lot) and in Redshift Spectrum you get a dedicated capacity that is dependent on your cluster size. For Redshift clusters with ~20 CPUs I've found Athena to perform better for most queries, but larger Redshift clusters may get better performance.