I'm writing this answer to access files with S3A from Spark 2.0.1 on Hadoop 2.7.3
Copy the AWS jars(hadoop-aws-2.7.3.jar
and aws-java-sdk-1.7.4.jar
) which shipped with Hadoop by default
into spark classpath which holds all spark jars
Hint: We can not directly point the location(It must be in property file) as I want to make an answer generic for distributions and Linux flavors. spark classpath can be identified by find command below
find / -name spark-core*.jar
in spark-defaults.conf
Hint: (Mostly it will be placed in /etc/spark/conf/spark-defaults.conf
)
#make sure jars are added to CLASSPATH
spark.yarn.jars=file://{spark/home/dir}/jars/*.jar,file://{hadoop/install/dir}/share/hadoop/tools/lib/*.jar
spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem
spark.hadoop.fs.s3a.access.key={s3a.access.key}
spark.hadoop.fs.s3a.secret.key={s3a.secret.key}
#you can set above 3 properties in hadoop level `core-site.xml` as well by removing spark prefix.
in spark submit include jars(aws-java-sdk
and hadoop-aws
) in --driver-class-path
if needed.
spark-submit --master yarn \
--driver-class-path {spark/jars/home/dir}/aws-java-sdk-1.7.4.jar \
--driver-class-path {spark/jars/home/dir}/hadoop-aws-2.7.3.jar \
other options
Note:
Make sure the Linux user with reading privileges, before running the
find
command to prevent error Permission denied