0
votes

I have a large set of history log files on aws s3 that sum billions of lines,

I used a glue crawler with a grok deserializer to generate an external table on Athena, but querying it has proven to be unfeasible.

My queries have timed out and I am trying to find another way of handling this data.

From what I understand, through Athena, external tables are not actual database tables, but rather, representations of the data in the files, and queries are run over the files themselves, not the database tables.

How can I turn this large dataset into a query friendly structure?

Edit 1: For clarification, I am not interested in reshaping the hereon log files, those are taken care of. Rather, I want a way to work with the current file base I have on s3. I need to query these old logs and at its current state it's impossible. I am looking for a way to either convert these files into an optimal format or to take advantage of the current external table to make my queries.

Right now, by default of the crawler, the external tables are only partitined by day and instance, my grok pattern explodes the formatted logs into a couple more columns that I would love to repartition on, if possible, which I believe would make my queries easier to run.

2
try using parquet files to store it. and then try to upload it to athena?Yatish Kadam
that would be awesome if possible, I am trying to run on historic data, and the files are not very well formattedGuga Figueiredo
"My queries have timed out" -- what do you mean by this?Piotr Findeisen
I mean they have run for the whole 30 minutes and have timed out. Didn't complete the queryGuga Figueiredo

2 Answers

0
votes

Your where condition should be on partitions (at-least one condition). By sending support ticket, you may increase athena timeout. Alternatively, you may use Redshift Spectrum

But you may seriously thing to optimize query. Athena query timeout is 30min. It means your query ran for 30mins before timed out.

0
votes

By default athena times out after 30 minutes. This timeout period can be increased but raising a support ticket with AWS team. However, you should first optimize your data and query as 30 minutes is good time for executing most of the queries.

Here are few tips to optimize the data that will give major boost to athena performance:

  • Use columnar formats like orc/parquet with compression to store your data.
  • Partition your data. In your case you can partition your logs based on year -> month -> day.
  • Create larger and lesser number of files per partition instead of small and more number of files.

The following AWS article gives detailed information for performance tuning in amazon athena

Top 10 performance tuning tips for amazon-athena