2
votes

Using Beeline connected to SparkSQL 1.3, I am trying to create a table that uses S3 data (using the s3a protocol):

CREATE EXTERNAL TABLE mytable (...) STORED AS PARQUET LOCATION 's3a://mybucket/mydata';

I get the following error:

Error: org.apache.spark.sql.execution.QueryExecutionException: FAILED: AmazonClientException Unable to load AWS credentials from any provider in the chain (state=,code=0)

I have the following environment variables set in spark-env.sh:

AWS_ACCESS_KEY_ID=<my_access_key>
AWS_SECRET_ACCESS_KEY=<my_secret_key>

I know it's picking up this environment because the classpath is also set here, and it pulls in the Hadoop tools lib (which has the S3 connector). However, when I show the variables in beeline it says they are undefined:

0: jdbc:hive2://localhost:10000> set env:AWS_ACCESS_KEY_ID;
+------------------------------------+
|                                    |
+------------------------------------+
| env:AWS_ACCESS_KEY_ID=<undefined>  |
+------------------------------------+
1 row selected (0.112 seconds)
0: jdbc:hive2://localhost:10000> set env:AWS_SECRET_ACCESS_KEY;
+----------------------------------------+
|                                        |
+----------------------------------------+
| env:AWS_SECRET_ACCESS_KEY=<undefined>  |
+----------------------------------------+
1 row selected (0.009 seconds)

Setting fs.s3a.access.key and fs.s3a.secret.key also fails to have any effect:

0: jdbc:hive2://localhost:10000> set fs.s3a.access.key=<my_access_key>;
0: jdbc:hive2://localhost:10000> set fs.s3a.secret.key=<my_secret_key>;

Is there somewhere else I need to set this environment?

FWIW, I can successfully use hadoop fs -ls s3a://mybucket/mydata to list the files.

UPDATE:

I added the following to hive-site.xml:

<property>
  <name>fs.s3a.access.key</name>
  <value>my_access_key</value>
</property>
<property>
  <name>fs.s3a.secret.key</name>
  <value>my_secret_key</value>
</property>

I can now create the table without error, but any attempt to query it results in this error:

Error: org.apache.spark.SparkException: Job aborted due to stage failure: 
Task 0 in stage 0.0 failed 1 times, most recent failure: 
Lost task 0.0 in stage 0.0 (TID 0, localhost): com.amazonaws.AmazonClientException: 
Unable to load AWS credentials from any provider in the chain
1

1 Answers

4
votes

The solution was to copy my hdfs-site.xml file (which contains the fs.s3a.access.key and fs.s3a.secret.key values) into $SPARK_HOME/conf. Then it magically worked.