pass hdfs path as environment variable in spark submit

Question

I am trying to run my spark program using spark submit on yarn cluster, I am reading an external config file which is put in the hdfs, I am running the job as-

./spark-submit --class com.sample.samplepack.AnalyticsBatch --master yarn-cluster --num-executors 3 --driver-memory 512m --executor-memory 512m --executor-cores 1 --driver-java-options "-Dext.properties.dir=hdfs://namenode:8020/tmp/some.conf" PocSpark-1.0-SNAPSHOT-job.jar 10

But it is unable to read the file from hdfs, I have also tried to run the job on local mode with conf file as hdfs path and I am getting-

java.io.FileNotFoundException: hdfs:/namenode:8020/tmp/some.conf (No such file or directory)

Here the after hdfs protocol forward slash is missing. Any help will be appreciated here.

can you see this file using hadoop utility? hadoop fs -ls /tmp/ — Nikita
yes file is available but spark-submit is unable to read the hdfs file path in my opinion. — Y0gesh Gupta
Do you have environment variable HADOOP_CONF_DIR. Type echo $HADOOP_CONF_DIR in console to check? — Nikita
Hi thanks for the reply, but yes HADOOP_CONF_DIR is set. The issue is that spark submit is filtering the '//' in hdfs://namenode:8020/tmp/some.conf to hdfs:/namenode:8020/tmp/some.conf and unable to reach the hdfs path. — Y0gesh Gupta

Nikita Nikita · Accepted Answer · 2015-04-15T11:56:13

You have to set HADOOP_CONF_DIR environment variable. It must point to directory with core-site.xml (it may be something like ../hadoop-2.6.0/etc/hadoop_dir) And core-site.xml must contain:

<configuration>
    <property>
        <name>fs.defaultFS</name>
        <value>hdfs://yourHost:54310</value>
    </property>
</configuration>

Hope this will help!

pass hdfs path as environment variable in spark submit

2 Answers