I have ORC data in S3 that looks like this:
s3://bucket/orc/clientId=client-1/year=2017/month=3/day=16/hour=20/
s3://bucket/orc/clientId=client-2/year=2017/month=3/day=16/hour=21/
s3://bucket/orc/clientId=client-3/year=2017/month=3/day=16/hour=22/
Every hour I run an EMR job that converts raw JSON in S3 to ORC, and write it out with the path partition convention (above) for Athena ingestion. After the EMR job completes, I run msck repair table so Athena can pick up the new partitions.
I have 3 related questions:
- Does running
msck repair tablein this scenario, cost me money in AWS? - AWS Docs say
msck repair tablecan timeout. Is there a way I can make a step in data pipeline to continue running this command until it completes successfully? - I would prefer to add the partitions manually to Athena (since I know the year,month,day,hour I'm working on). However I do not know the
clientIdbecause there could be 1-X of them, and I don't know which ones exist at time of running EMR. Is there a best practice way to solve this problem (using Hive or something else)? I could make an s3 api call to get a list ofs3://bucket/org/and write code to iterate over list and add manually. I'm hoping there is an easier way...
Note: when I say "add partitions manually" I mean doing something like this:
ALTER TABLE <athena table>
ADD PARTITION (clientId='client-1',year=2017,month=3,day=16,hour=20)
location 's3://bucket/orc/clientId=client-1/year=2017/month=3/day=16/hour=20/';