I have created a AWS EMR Cluster and uploaded,
sparkify_log_small.json
And created a EMR Jupyter Notebook with below code thinking it would read from user(hadoop) home directory.
sparkify_log_data = "sparkify_log_small.json"
df = spark.read.json(sparkify_log_data)
df.persist()
df.head(5)
But when submit the code, i get the below error.
'Path does not exist: hdfs://ip-172-31-50-58.us-west-2.compute.internal:8020/user/livy/sparkify_log_small.json;'
Traceback (most recent call last):
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 274, in json
return self._df(self._jreader.json(self._spark._sc._jvm.PythonUtils.toSeq(path)))
File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
answer, self.gateway_client, self.target_id, self.name)
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 69, in deco
raise AnalysisException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.AnalysisException: 'Path does not exist: hdfs://ip-172-31-50-58.us-west-2.compute.internal:8020/user/livy/sparkify_log_small.json;'
From googling got to know that YARN default user is livy. How can i change the user in the jupyter notebook from livy to hadoop (or) point to the right directory.
I have tried creating a folder like below and copying file from /home/hadoop/sparkify_log_small.json to /home/livy/sparkify_log_small.json
but did not work.
Here basically i am trying to read a file from ec2-master from notebook.