Spark yarn-cluster mode - read file passed with --files

Question

I'm running my spark application using yarn-cluster master.

What does the app do?

External service generates a jsonFile based on HTTP request to a RESTService
Spark needs to read this file and do some work after parsing the json

Simplest solution that came to mind was to use --files to load that file. In yarn-cluster mode reading a file means it must be available on hdfs (if I'm right?) and my file is being copied to path like this:

/hadoop_user_path/.sparkStaging/spark_applicationId/myFile.json

Where I can of course read it, but I cannot find a way to get this path from any configuration / SparkEnv object. And hardcoding .sparkStaging in spark code seamed like a bad idea.

Why simple:

val jsonStringData = spark.textFile(myFileName)
sqlContext.read.json(jsonStringData)

cannot read file passed with --files and throws FileNotFoundException? Why is spark looking for files in hadoop_user_folder only?

My solution which works for now:

Just before running spark, I copy file to proper hdfs folder, pass the filename as Spark argument, process the file from a known path and after the job is done I delete the file form hdfs.

I thought passing the file as --files would let me forget about saving and deleting this file. Something like pass-process-andforget.

How do you read a file passed with --files then? The only solution is with creating path by hand, hardcoding ".sparkStaging" folder path?

YoYo YoYo · Accepted Answer · 2015-11-20T16:19:45

The question is written very ambiguously. However, from what I seem to get is that you want to read a file from any location of your Local OS File System, and not just from HDFS.

Spark uses URI's to identify paths, and in the availability of a valid Hadoop/HDFS Environment, it will default to HDFS. In that case, to point to your Local OS FileSystem, in the case of for example UNIX/LINUX, you can use something like:

file:///home/user/my_file.txt

If you are using an RDD to read from this file, you run in yarn-cluster mode, or the file is accessed within a task, you will need to take care of copying and distributing that file manually to all nodes in your cluster, using the same path. That is what it makes it easy of first putting it on hfs, or that is what the --files option is supposed to do for you.

See more info on Spark, External Datasets.

For any files that were added through the --files option, or were added through SparkContext.addFile, you can get information about their location using the SparkFiles helper class.

Spark yarn-cluster mode - read file passed with --files

My solution which works for now:

4 Answers