I'm running my spark application using yarn-cluster master.
What does the app do?
- External service generates a jsonFile based on HTTP request to a RESTService
- Spark needs to read this file and do some work after parsing the json
Simplest solution that came to mind was to use --files to load that file. In yarn-cluster mode reading a file means it must be available on hdfs (if I'm right?) and my file is being copied to path like this:
/hadoop_user_path/.sparkStaging/spark_applicationId/myFile.json
Where I can of course read it, but I cannot find a way to get this path from any configuration / SparkEnv object. And hardcoding .sparkStaging in spark code seamed like a bad idea.
Why simple:
val jsonStringData = spark.textFile(myFileName)
sqlContext.read.json(jsonStringData)
cannot read file passed with --files and throws FileNotFoundException? Why is spark looking for files in hadoop_user_folder only?
My solution which works for now:
Just before running spark, I copy file to proper hdfs folder, pass the filename as Spark argument, process the file from a known path and after the job is done I delete the file form hdfs.
I thought passing the file as --files would let me forget about saving and deleting this file. Something like pass-process-andforget.
How do you read a file passed with --files then? The only solution is with creating path by hand, hardcoding ".sparkStaging" folder path?