0
votes

I have installed both hadoop and spark locally on a windows machine.

I can access HDFS files in hadoop, e.g.,

hdfs dfs -tail hdfs:/out/part-r-00000

works as expected. However, if I try to access the same file from the spark shell, e.g.,

val f = sc.textFile("hdfs:/out/part-r-00000")

I get an error that the file does not exist. Spark can access files in the windows file system using the file:/... syntax, though.

I have set the HADOOP_HOME environment variable to c:\hadoop which is the folder containing the hadoop install (in particular winutils.exe, which seems to be necessary for spark, is in c:\hadoop\bin).

Because it seems that HDFS data is stored in the c:\tmp folder, I was wondering whether there is would be a way to let spark know about this location.

Any help would be greatly appreciated. Thank you.

1
I just realized I posted this in data science - I think it rather belongs to stack overflow - sorry about that. - wawrzeniec
Spark needs to know about your hadoop-env.sh, core-site.xml, and maybe hdfs-site.xml files - OneCricketeer

1 Answers

0
votes

If you are getting file doesn't exist, that means your spark application(code snippet) is able to connect to HDFS. The HDFS file path that you are using seems wrong.

This should solve your issue

val f = sc.textFile("hdfs://localhost:8020/out/part-r-00000")