0
votes

Hey I all I have 1 Master and 1 Slave Node Standalone Spark Cluster on AWS. I have a folder my home directory called ~/Notebooks. This is were I launch jupyter notebooks and connect jupyter in my browser. I also have a file in there called people.json (simple json file).

I try running this code

from pyspark import SparkContext, SparkConf
from pyspark.sql import SQLContext

conf = SparkConf().setAppName('Practice').setMaster('spark://ip-172-31-2-186:7077')
sc = SparkContext(conf=conf)

sqlContext = SQLContext(sc)

df = sqlContext.read.json("people.json")

I get this error when i run that last line. I don't get it the file is right there... Any Ideas?-

Py4JJavaError: An error occurred while calling o238.json. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 4.0 failed 4 times, most recent failure: Lost task 1.3 in stage 4.0 (TID 37, ip-172-31-7-160.us-west-2.compute.internal): java.io.FileNotFoundException: File file:/home/ubuntu/Notebooks/people.json does not exist

1
are u sure this file is on all worker nodes too? - Aravind Yarram
Ohh crap, I did not realize it needs to be on the worker nodes... Does it even need to be on the master node then? - Neil

1 Answers

1
votes

Make sure the file is available on the worker nodes. Best way is to use a shared files system (NFS, HDFS). Read External Datasets documentation