I am using PySpark to read S3 files in PyCharm. The following errors returned:
py4j.protocol.Py4JJavaError: An error occurred while calling o26.partitions. org.apache.hadoop.security.AccessControlException: Permission denied: s3n://2017/01/22/20/firenam:
code is like this :
hadoopConf = sc._jsc.hadoopConfiguration()
hadoopConf.set("fs.s3n.awsAccessKeyId", "myaccesskey")
hadoopConf.set("fs.s3n.awsSecretAccessKey", "MySecretKey")
temp = sc.textFile("s3n://2017/01/22/filename")
temp.count()
When I am using Boto3 to download file from S3 with Python , it can succeed.
Change "s3n" to "s3a" still failed, with a different exception:
error returned : java.lang.NoClassDefFoundError: org/apache/hadoop/fs/GlobalStorageStatistics$StorageStatisticsProvider
I've also tried to export following environment variable:
AWS_ACCESS_KEY_ID = myaccesskey .
AWS_SECRET_ACCESS_KEY = mysecretkey
or add them explicitly in os.environ, also failed.
My environment is:
OS: Mac Sierra 10.12.6
Spark: 2.2.0
Python: 3.6.1
I have the following submit parameter in code
SUBMIT_ARGS = "--master local[*] --jars /ExternalJar/aws-java-sdk-1.7.4.jar,/ExternalJar/hadoop-aws-2.7.3.jar pyspark-shell"
The job is directly run in PyCharm IDE.
Anyone have clues ?