How simple deploy spark jar to remote hadoop cluster?

Question

I have Hadoop cluster Cloudera CDH 5.2 with Apache Spark 1.5.0.

Can I run my app from IntelliJ IDEA or local PC using cluster's YARN, Spark and HDFS?

Or should I send jar via ftp to the master node, and run it through the spark-submit?

Mariusz Mariusz · Accepted Answer · 2016-11-03T13:53:51

Yes, you can run your job directly from the IDE if you follow these steps:

Add spark-yarn package to your project dependencies (can be marked as provided)
Add directory with hadoop configuration (HADOOP_CONF_DIR) to the project classpath
Copy spark assembly jar to HDFS

Then configure spark context in your application using config:

SparkConf sparkConfig = new SparkConf().
    .setMaster("yarn-client")
    .set("spark.yarn.queue", "if_you_are_using_scheduler")
    .set("spark.yarn.jar", "hdfs:///path/to/assembly/on/hdfs");

If your Hadoop is secured deployment, there is also need to

change JRE to JRE with JCE enabled
add krb5.conf to java parameters (-Djava.security.krb5.conf=/path/to/local/krb5.conf)
call kinit inside your environment

I tested this solution some time ago on Spark 1.2.0 on CDH also, but it should work on 1.5. Remember, that this approach makes your local machine a spark driver so be aware of some firewalling isseus between driver and executors - your local machine should be accessible from hadoop nodes.

How simple deploy spark jar to remote hadoop cluster?

1 Answers