I am a newbie to Spark. I'm trying to read a local csv file within an EMR cluster. The file is located in: /home/hadoop/. The script that I'm using is this one:
spark = SparkSession \
.builder \
.appName("Protob Conversion to Parquet") \
.config("spark.some.config.option", "some-value") \
.getOrCreate()\
df = spark.read.csv('/home/hadoop/observations_temp.csv, header=True)
When I run the script raises the following error message:
pyspark.sql.utils.AnalysisException: u'Path does not exist: hdfs://ip-172-31-39-54.eu-west-1.compute.internal:8020/home/hadoop/observations_temp.csv
Then, I found out that I have to add file:// in the file path so it can read the file locally:
df = spark.read.csv('file:///home/hadoop/observations_temp.csv, header=True)
But this time, the above approach raised a different error:
Lost task 0.3 in stage 0.0 (TID 3,
ip-172-31-41-81.eu-west-1.compute.internal, executor 1): java.io.FileNotFoundException: File file:/home/hadoop/observations_temp.csv does not exist
I think is because the file// extension just read the file locally and it does not distribute the file across the other nodes.
Do you know how can I read the csv file and make it available to all the other nodes?