I've been reading this aws blog article and it made sense to me up until the part where it talks about partitions. The query it uses to create the table looks like this:
CREATE EXTERNAL TABLE IF NOT EXISTS elb_logs_raw_native_part (
request_timestamp string,
elb_name string,
request_ip string,
request_port int,
backend_ip string,
backend_port int,
request_processing_time double,
backend_processing_time double,
client_response_time double,
elb_response_code string,
backend_response_code string,
received_bytes bigint,
sent_bytes bigint,
request_verb string,
url string,
protocol string,
user_agent string,
ssl_cipher string,
ssl_protocol string )
PARTITIONED BY(year string, month string, day string) -- Where does Athena get this data?
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
'serialization.format' = '1','input.regex' = '([^ ]*) ([^ ]*) ([^ ]*):([0-9]*) ([^ ]*)[:\-]([0-9]*) ([-.0-9]*) ([-.0-9]*) ([-.0-9]*) (|[-0-9]*) (-|[-0-9]*) ([-0-9]*) ([-0-9]*) \\\"([^ ]*) ([^ ]*) (- |[^ ]*)\\\" (\"[^\"]*\") ([A-Z0-9-]+) ([A-Za-z0-9.-]*)$' )
LOCATION 's3://athena-examples/elb/raw/';
What's confusing to me about this is how it's saying, "partition by year (among other things)", but nowhere else in that "SQL" does it specify the part of the data that's a year. Also, none of these column names have a type of date. So how does Athena know how to partition this data when you haven't told it which part of the data is the year, month, or day?
In the context of the blog article, it says the year comes from the file name, but there was no step to tell Athena that information. The article says this is the pre-defined format: https://docs.aws.amazon.com/elasticloadbalancing/latest/application/load-balancer-access-logs.html#access-log-entry-format but that doesn't have a year column that I can see.
Edit: The article was not very explicit about this, but I think it might be saying that every PARTITIONED BY column is a sub-dir inside the s3 bucket? In other words, the first element in the PARTITION BY clause (year in this case) is the first sub-dir of the bucket, and so on.
That only makes partial sense to me, because the same article says, "You can partition your data across multiple dimensions―e.g., month, week, day, hour, or customer ID―or all of them together." I don't understand how you could do all of that if they come from sub-dirs unless you had a ton of duplication in your bucket.