0
votes

I am using PySpark to read S3 files in PyCharm. The following errors returned:

py4j.protocol.Py4JJavaError: An error occurred while calling o26.partitions. org.apache.hadoop.security.AccessControlException: Permission denied: s3n://2017/01/22/20/firenam:

code is like this :

hadoopConf = sc._jsc.hadoopConfiguration()
hadoopConf.set("fs.s3n.awsAccessKeyId", "myaccesskey")
hadoopConf.set("fs.s3n.awsSecretAccessKey", "MySecretKey")
temp = sc.textFile("s3n://2017/01/22/filename")
temp.count()

When I am using Boto3 to download file from S3 with Python , it can succeed.

Change "s3n" to "s3a" still failed, with a different exception:

error returned : java.lang.NoClassDefFoundError: org/apache/hadoop/fs/GlobalStorageStatistics$StorageStatisticsProvider

I've also tried to export following environment variable:

AWS_ACCESS_KEY_ID = myaccesskey .
AWS_SECRET_ACCESS_KEY = mysecretkey

or add them explicitly in os.environ, also failed.

My environment is:

OS: Mac Sierra 10.12.6
Spark: 2.2.0
Python: 3.6.1

I have the following submit parameter in code

SUBMIT_ARGS = "--master local[*] --jars /ExternalJar/aws-java-sdk-1.7.4.jar,/ExternalJar/hadoop-aws-2.7.3.jar pyspark-shell"

The job is directly run in PyCharm IDE.

Anyone have clues ?

1

1 Answers

2
votes

It looks like you didn't set bucket name in s3n://2017/01/22/filename. The valid path should be s3n://bucket_name/path_to_file.