1
votes

I am trying query AWS S3 Invetory List using Athena. I can do this if I have only one source bucket. I am not sure how to configure this to work with multiple source buckets.

we are using all the default configuration options with data format as CSV. S3 Inventory destination bucket name pattern for hive is like this:

 destination-prefix/source-bucket/config-ID/hive/dt=YYYY-MM-DD-HH-MM/symlink.txt

So when I am creating an Athena table I have to use static hive path.

CREATE EXTERNAL TABLE your_table_name(
  //column names
)
PARTITIONED BY (dt string)
//options ignored
LOCATION 's3://destination-prefix/source-bucket/config-ID/hive/';

So if I want to query inventory data for multiple source buckets, it seems like I have to create a table for each "source-bucket".

Alternatively, with out using Athena, I am trying to do this with AWS CLI

aws s3 ls s3://our-bucket-name/prefix/abc --recursive | awk ‘$1 > “2019-04-01”

But this gets every single file first as there is no option to set --include or --exclude with "S3 ls"

finally, the questions are:

  1. Can I configure AWS Inventory to generate inventory for multiple S3 buckets to that it puts everything into the same "hive" directory (i.e. ignore "source-bucket" prefix while generating Invetory)?

  2. Is it possible to configure Athena read from multiple hive locations? But with the possibility of new buckets getting created and dropped, I guess this gets ugly.

  3. Is there any alternative way to query inventory list instead of Athena or AWS CLI or writing a custom code to use manifest.json file to get these csv files.

1

1 Answers

2
votes

You can't make S3 Inventory create one inventory for multiple buckets, unfortunately. You can however splice the inventories together into one table.

The guide you link to says to run MSCK REPAIR TABLE … to load your inventories. I would recommend you to not do that, because it will create weird tables with partitions that each represent the inventory of some point in time, which is something you might want if you want to compare what's in a bucket from day to day or week to week, but probably not what you want most of the time. Most of the time you want to know what's in the bucket right now. To get multiple inventories into the same table you should also not run that command.

First you change how you create the table slightly:

CREATE EXTERNAL TABLE your_table_name(
  //column names
)
PARTITIONED BY (bucket_name string)
//options ignored
LOCATION 's3://destination-prefix/source-bucket/config-ID/hive/';

Notice that I changed the partitioning from dt string to bucket_name string.

Next you add the partitions manually:

ALTER TABLE your_table_name
ADD PARTITION (bucket_name = 'some-bucket') LOCATION 's3://destination-prefix/source-bucket/config-ID1/hive/dt=YYYY-MM-DD/'
ADD PARTITION (bucket_name = 'another-bucket') LOCATION 's3://destination-prefix/source-bucket/config-ID2/hive/dt=YYYY-MM-DD/';

The locations should be the S3 URIs to the latest dates under the "hive" prefix of the inventory for each bucket.

The downside of this is that when new inventories are delivered you will need to update the table to point to these new locations. You can do this by first dropping the partitions:

ALTER TABLE your_table_name
DROP PARTITION (bucket_name = 'some-bucket')
DROP PARTITION (bucket_name = 'another-bucket');

and then adding them again using the same SQL as above, but with new S3 URIs.