3
votes

Given a data-source of 1.4 TB of Parquet data on S3 partitioned by a timestamp field (so partitions are year - month - day) I am querying a specific day of data (2.6 GB of data) and retrieving all available fields in the Parquet files via Redshift Spectrum with this query:

SELECT *
FROM my_external_schema.my_external_table
WHERE year = '2020' and month = '01' and day = '01'

The table is made available via a Glue Crawler that points at the top level "folder" in S3; this creates a Database and then via this command I link the Database to the new external schema:

create external schema my_external_schema from data catalog
database 'my_external_schema'
iam_role 'arn:aws:iam::123456789:role/my_role'
region 'my-region-9';

Analysing the table in my IDE I can see the table is generated by this statement:

create external table my_external_schema.my_external_table
    (
    id string,
    my_value string,
    my_nice_value string
    )
partitioned by (year string, month string, day string)
row format serde 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
with serdeproperties ('serialization.format'='1')
stored as
inputformat 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
outputformat 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
location 's3://my-bucket/my/location/'
table properties ('CrawlerSchemaDeserializerVersion'='1.0', 'CrawlerSchemaSerializerVersion'='1.0', 'UPDATED_BY_CRAWLER'='my_crawler');

When I analyse the query from Redshift I see it was scanned ~86 GB of data instead.

How's that possible? It is a concern because Redshift bills based on the amount of data scanned and looks like the service is scanning around 40 times the actual amount of data is in that partition.

I also tried to execute the same query in Athena and there I get only 2.55 GB of data scanned (definitely more reasonable).

I can't give too many details on the cluster size but assume that those 86GB of scanned data would fit in the cluster's Memory.

1
What is the total size of the data (not partitioned)? Are you sure that the data is partitioned in the correct manner? Could you show us your CREATE EXTERNAL TABLE command, and an example of the path names used for the data?John Rotenstein
Hey @JohnRotenstein I updated the question trying to answer to all your inquiries. Does it give you all the information you asked? Any other doubts? Thanks for your attention anyway ;)Vzzarr
Would you be able to run the same query in Athena to compare how much data is scanned there? Also would be good if you could show the create table statement.Philipp Johannis
Hi @PhilippJohannis, as I wrote in the question the table is automatically created by the Glue Crawler. Is there anything else you'd like to know?Vzzarr
Well you can still get the create table statement by running SHOW CREATE TABLE .. - at least in Athena. Or you can also share a screenshot of the definition in Glue.Philipp Johannis

1 Answers

3
votes

The problem seems to be in the AWS Redshift Console.

If we analyse the query from "query details" in Redshift console, I can see that the "Total data scanned" reports 86GB. As Vzarr mentioned, I run the same query on Athena to compare the performance. The execution time was basically the same but the amount of data scanned was completely different: 2.55GB.

I did the same comparison with other queries on S3 external schema, with and without using partitions columns: I saw that the total of GB scanned differs in every test, sometimes differs a lot (320MB in Redshift Spectrum, 20GB in Athena).

I decided to look at the system tables in Redshift in order to understand how the query on the external schema was working. I did a very simple test using SVL_S3QUERY:

SELECT (cast(s3_scanned_bytes as double precision) / 1024 / 1024 / 1024) as gb_scanned,
       s3_scanned_rows,
       query
FROM SVL_S3QUERY
WHERE query = '<my-query-id>'

The result was completely different from what AWS Redshift Console says for the same query. Not only the gb_scanned was wrong, but s3_scanned_rows was too. The query returns a total of 2.55GB of data Scanned, exactly the same of what Athena said.

To confirm the numbers in the SVL_S3QUERY I used AWS Cost Explorer to double check the total of gb scanned in a day with how much we paid for Redshift Spectrum: the numbers were basically the same.

At this point, I don't know from where or which table the AWS Redshift Console take the query details, but they seem to be completely wrong.