2
votes

I've followed various published documentation on integrating Apache Hive 2.1.1 with AWS S3 using the s3a:// scheme, configuring fs.s3a.access.key and fs.s3a.secret.key for hadoop/etc/hadoop/core-site.xml and hive/conf/hive-site.xml.

I am at the point where I am able to get hdfs dfs -ls s3a://[bucket-name]/ to work properly (it returns s3 ls of that bucket). So I know my creds, bucket access, and overall Hadoop setup is valid.

hdfs dfs -ls s3a://[bucket-name]/

drwxrwxrwx   - hdfs hdfs          0 2017-06-27 22:43 s3a://[bucket-name]/files
...etc. 

hdfs dfs -ls s3a://[bucket-name]/files

drwxrwxrwx   - hdfs hdfs          0 2017-06-27 22:43 s3a://[bucket-name]/files/my-csv.csv

However, when I attempt to access the same s3 resources from hive, e.g. run any CREATE SCHEMA or CREATE EXTERNAL TABLE statements using LOCATION 's3a://[bucket-name]/files/', it fails.

for example:

CREATE EXTERNAL TABLE IF NOT EXISTS mydb.my_table ( my_table_id string, my_tstamp timestamp, my_sig bigint ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LOCATION 's3a://[bucket-name]/files/';

I keep getting this error:

FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. MetaException(message:Got exception: java.nio.file.AccessDeniedException s3a://[bucket-name]/files: getFileStatus on s3a://[bucket-name]/files: com.amazonaws.services.s3.model.AmazonS3Exception: Forbidden (Service: Amazon S3; Status Code: 403; Error Code: 403 Forbidden; Request ID: C9CF3F9C50EF08D1), S3 Extended Request ID: T2xZ87REKvhkvzf+hdPTOh7CA7paRpIp6IrMWnDqNFfDWerkZuAIgBpvxilv6USD0RSxM9ymM6I=)

This makes no sense. I have access to the bucket as one can see in the hdfs test. And I've added the proper creds to hive-site.xml.

NOTE: Using the same creds, I have this working for 's3n://' and 's3a://'. It just fails for 's3a://'.

Anyone have any idea what's missing from this equation?

1
It sounds as if the aws-access-key-id lacks a permission it needs. Turning on logging for the bucket to determine exactly what request S3 is denying might help. - Michael - sqlbot
The access key id has all the perms it needs. It wouldn't work for hadoop if it didn't. This is some Hive peculiarity or bug. - axbo
The same access key id works when 's3n://' is used. It just fails on 's3a://' without proper error messaging despite all logging turned on. I expect that either there is yet another badly documented 's3a://' configuration setting that I need to use, or that the integration simply does not work as expected when specifying access key and secret key creds. - axbo
It could be either. But it also could be that s3a makes an unnecessary request that the others don't (e.g. get-object-acl) or doesn't make correct assumptions in cases where the data in S3 is not data that it created itself, with whatever quirks that might have. It's obviously contacting the service, so I'd still say look at the S3 logs. What requests are made? Are they actually being made with the provided credentials, or are they for some reason anonymous requests, etc. - Michael - sqlbot
We try our best with documentation, but as an OSS project, we welcome improvements. Are you using the fs.s3a.secret.key and fs.s3a.access.key names? They are different from the s3n ones. (This is just a HEAD request failing BTW) - stevel

1 Answers

0
votes

Are you using EMR for your Hive environment? If so, EMR doesn't support s3a.