How to externalize and load property file from external path in a spark job when submitting the job

Question

I am using java8 and spark 2.4.1 to write my spark-job, in which I am using TypeSafe to load the property file i.e. application.properties which is located in "resources" folder, whose contents are like below

dev.deploymentMaster=local[8]
dev.spark.eventLog.enabled=true
dev.spark.dynamicAllocation.enabled=false
dev.spark.executor.memory=8g

In program I am loading the same as below passing "environment" variable as "dev" while submitting the spark job i.e. spark-submit

 public static Config loadEnvProperties(String environment) {
      Config appConf = ConfigFactory.load();
      return  appConf.getConfig(environment);
  }

Above is working fine....but this "application.properties" file is inside the "resources" folder.

How can I pass the "application.properties" file path while submitting from spark-submit job ? What changes I need to do in my code using TypeSafe? can you please provide some sample if possible in java?

In sprint boot we have something called profiling like application-dev.properties , application-qa.properties and application-prod.properties etc.... to load those specific environment properties is there something like possible in spark while submitting the job? If so can you please provide some details or snippet how to do achieve it ?

Ram Ghadiyaram Ram Ghadiyaram · Accepted Answer · 2019-07-31T04:41:08

but this "application.properties" file is inside the "resources" folder.

How can I pass the "application.properties" file path while submitting from spark-submit job ?

1) prepare maven assembly or shade plugin/sbt distribution structure as bin for shell scripts lib for libraries or uber jar conf for all configuration files like application.properties or application.conf

example distribution structure :

.
└── yourproject
    ├── bin // all shell scripts and spark-submits
    ├── conf // your property file environment wise
    │   ├── application.conf
    │   └── log4j.properties
    └── lib   // your jars or uber jar

2) prepare a shell script which will accept the envt parameter use like below

Deploy-mode cluster :

spark-submit --master yarn --deploy-mode cluster  --num-executors 4 --driver-memory 6g --executor-memory 20g --executor-cores 4 --files conf/application_$env.conf --class yourclass lib/yourjar.jar

Deploy-mode client :

spark-submit --master yarn --deploy-mode client --num-executors 4 --driver-memory 6g --executor-memory 20g --executor-cores 4 --files conf/application$env.conf --spark.driver.extraJavaOptions -Dconfig.file=conf/application$env.conf  --spark.executor.extraJavaOptions -Dconfig.file=conf/application$env.conf --class yourclass  lib/yourjar.jar

your configfactor.load will load from this -Dconfig.file=conf/application$env.conf but in deploy-mode cluster it may not load from this system property since your driver is not local machine. its one of the node in your cluster.

-- files will pass your file to temp directory under your hdfs. --files you can just refer to the file name without any path you may need to use ConfigFactory.parseFile(configFile) instead of ConfigFactory.load() in cluster mode. since I observed that load loading \etc\spark\conf\spark-default.conf as I observed in my case.

How to externalize and load property file from external path in a spark job when submitting the job

1 Answers