How to get access to HDFS files in Spark standalone cluster mode?

Question

I am trying to get access to HDFS files in Spark. Everything works fine when I run Spark in local mode, i.e.

SparkSession.master("local")

and get access to HDFS files by

hdfs://localhost:9000/$FILE_PATH

But when I am trying to run Spark in standalone cluster mode, i.e.

SparkSession.master("spark://$SPARK_MASTER_HOST:7077")

Error throws

 java.lang.ClassCastException: cannot assign instance of java.lang.invoke.SerializedLambda to field org.apache.spark.api.java.JavaPairRDD$$anonfun$toScalaFunction$1.fun$1 of type org.apache.spark.api.java.function.Function in instance of org.apache.spark.api.java.JavaPairRDD$$anonfun$toScalaFunction$1

So far I have only start-dfs.sh in Hadoop and does not really config anything in Spark. Do I need to run Spark using YARN cluster manager instead so that Spark and Hadoop are using the same cluster manager, hence can get access to HDFS files?

I have tried to config yarn-site.xml in Hadoop following tutorialspoint https://www.tutorialspoint.com/hadoop/hadoop_enviornment_setup.htm, and specified HADOOP_CONF_DIR in spark-env.sh, but it does not seem to work and the same error throws. Am I missing some other configurations?

Thanks!

EDIT

The initial Hadoop version is 2.8.0 and the Spark version is 2.1.1 with Hadoop 2.7. Tried to download hadoop-2.7.4 but the same error still exists.

The question here suggests this as a java syntax issue rather than spark hdfs issue. I will try this approach and see if this solves the error here.

This doesn't look HDFS related, more like Scala versioning issue. — Yuval Itzchakov
@YuvalItzchakov Thanks for such quick response! I will double check my Scala version. Just want to clarify, do you suggest it is the version mismatch between Spark and Scala, or Spark's Scala and Hadoop's Scala? I downloaded spark-2.1.1-bin-hadoop2.7 and hadoop-2.8.0, should I try hadoop 2.7.0 instead? — JWC ToT
I would go with hadoop2.7. Make sure the Scala version is 2.11 (which is what Spark is compiled with). — Yuval Itzchakov

JWC ToT JWC ToT · Accepted Answer · 2017-08-09T06:25:09

Inspired by the post here, solved the problem by myself.

This map-reduce job depends on a Serializable class, so when running in Spark local mode, this serializable class can be found and the map-reduce job can be executed dependently.

When running in Spark standalone cluster mode, the best is to submit the application through spark-submit, rather than running in an IDE. Packaged everything in jar and spark-submit the jar, works as a charm!

How to get access to HDFS files in Spark standalone cluster mode?

1 Answers