I am trying query AWS S3 Invetory List using Athena. I can do this if I have only one source bucket. I am not sure how to configure this to work with multiple source buckets.
we are using all the default configuration options with data format as CSV. S3 Inventory destination bucket name pattern for hive is like this:
destination-prefix/source-bucket/config-ID/hive/dt=YYYY-MM-DD-HH-MM/symlink.txt
So when I am creating an Athena table I have to use static hive path.
CREATE EXTERNAL TABLE your_table_name(
//column names
)
PARTITIONED BY (dt string)
//options ignored
LOCATION 's3://destination-prefix/source-bucket/config-ID/hive/';
So if I want to query inventory data for multiple source buckets, it seems like I have to create a table for each "source-bucket".
Alternatively, with out using Athena, I am trying to do this with AWS CLI
aws s3 ls s3://our-bucket-name/prefix/abc --recursive | awk ‘$1 > “2019-04-01”
But this gets every single file first as there is no option to set --include or --exclude with "S3 ls"
finally, the questions are:
Can I configure AWS Inventory to generate inventory for multiple S3 buckets to that it puts everything into the same "hive" directory (i.e. ignore "source-bucket" prefix while generating Invetory)?
Is it possible to configure Athena read from multiple hive locations? But with the possibility of new buckets getting created and dropped, I guess this gets ugly.
Is there any alternative way to query inventory list instead of Athena or AWS CLI or writing a custom code to use manifest.json file to get these csv files.