2
votes

I am trying to read and write from s3 buckets using pyspark with the help of these two libraries from maven https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-aws/2.7.7 and https://mvnrepository.com/artifact/com.amazonaws/aws-java-sdk/1.7.4 which are really old. I tried with the different combinations of hadoop-aws and aws-java-SDK but it's not working with the pyspark version 2.4.4 . does anyone know which versions of Hadoop and java SDK's are compatible with spark version 2.4.4?

1

1 Answers

2
votes

I am using the following:

Spark: 2.4.4
Hadoop: 2.7.3
Haddop-AWS: hadoop-aws-2.7.3.jar
AWS-JAVA-SDK: aws-java-sdk-1.7.3.jar
Scala: 2.11

Works for me and use s3a://bucket-name/

(Note: For PySPark I used aws-java-sdk-1.7.4.jar) because I wasn't able to use

df.write.csv(path=path, mode="overwrite", compression="None")