AWS Athena - Query over large external table generated from Glue crawler?

Question

I have a large set of history log files on aws s3 that sum billions of lines,

I used a glue crawler with a grok deserializer to generate an external table on Athena, but querying it has proven to be unfeasible.

My queries have timed out and I am trying to find another way of handling this data.

From what I understand, through Athena, external tables are not actual database tables, but rather, representations of the data in the files, and queries are run over the files themselves, not the database tables.

How can I turn this large dataset into a query friendly structure?

Edit 1: For clarification, I am not interested in reshaping the hereon log files, those are taken care of. Rather, I want a way to work with the current file base I have on s3. I need to query these old logs and at its current state it's impossible. I am looking for a way to either convert these files into an optimal format or to take advantage of the current external table to make my queries.

Right now, by default of the crawler, the external tables are only partitined by day and instance, my grok pattern explodes the formatted logs into a couple more columns that I would love to repartition on, if possible, which I believe would make my queries easier to run.

try using parquet files to store it. and then try to upload it to athena? — Yatish Kadam
that would be awesome if possible, I am trying to run on historic data, and the files are not very well formatted — Guga Figueiredo
I mean they have run for the whole 30 minutes and have timed out. Didn't complete the query — Guga Figueiredo

Sandeep Fatangare Sandeep Fatangare · Accepted Answer · 2019-09-26T03:42:24

Your where condition should be on partitions (at-least one condition). By sending support ticket, you may increase athena timeout. Alternatively, you may use Redshift Spectrum

But you may seriously thing to optimize query. Athena query timeout is 30min. It means your query ran for 30mins before timed out.

AWS Athena - Query over large external table generated from Glue crawler?

2 Answers