0
votes

I am trying to submit the following job to my cluster, with Spark 3.0.0 and Mesos 1.9.

./bin/spark-submit \
        --name test2 \
        --master mesos://master:7077 \
        --deploy-mode cluster \
        --class org.apache.spark.examples.SparkPi \
        --conf spark.master.rest.enabled=true \
        ./examples/jars/spark-examples_2.12-3.0.0.jar 100

However, I have received the following error message.

I0916 21:26:23.155861 8587 fetcher.cpp:562] Fetcher Info: {"cache_directory":"/tmp/mesos/fetch/root","items":[{"action":"BYPASS_CACHE","uri":{"cache":false,"extract":true,"value":"/spark-3.0.0-bin-SparkFHE/examples/jars/spark-examples_2.12-3.0.0.jar"}}],"sandbox_directory":"/var/lib/mesos/slaves/b61fd963-8537-48f0-9eb6-e26f3aa97265-S0/frameworks/92ca9c69-72c9-43d1-828e-ecc8bac62eff-0000/executors/driver-20200916212624-0041/runs/46a1e00e-0c01-47b5-82f5-a46ba5237321","stall_timeout":{"nanoseconds":60000000000},"user":"root"} I0916 21:26:23.165118 8587 fetcher.cpp:459] Fetching URI '/spark-3.0.0-bin-SparkFHE/examples/jars/spark-examples_2.12-3.0.0.jar' I0916 21:26:23.165141 8587 fetcher.cpp:290] Fetching '/spark-3.0.0-bin-SparkFHE/examples/jars/spark-examples_2.12-3.0.0.jar' directly into the sandbox directory W0916 21:26:23.168915 8587 fetcher.cpp:332] Copying instead of extracting resource from URI with 'extract' flag, because it does not seem to be an archive: /spark-3.0.0-bin-SparkFHE/examples/jars/spark-examples_2.12-3.0.0.jar I0916 21:26:23.168941 8587 fetcher.cpp:618] Fetched '/spark-3.0.0-bin-SparkFHE/examples/jars/spark-examples_2.12-3.0.0.jar' to '/var/lib/mesos/slaves/b61fd963-8537-48f0-9eb6-e26f3aa97265-S0/frameworks/92ca9c69-72c9-43d1-828e-ecc8bac62eff-0000/executors/driver-20200916212624-0041/runs/46a1e00e-0c01-47b5-82f5-a46ba5237321/spark-examples_2.12-3.0.0.jar' I0916 21:26:23.168957 8587 fetcher.cpp:623] Successfully fetched all URIs into '/var/lib/mesos/slaves/b61fd963-8537-48f0-9eb6-e26f3aa97265-S0/frameworks/92ca9c69-72c9-43d1-828e-ecc8bac62eff-0000/executors/driver-20200916212624-0041/runs/46a1e00e-0c01-47b5-82f5-a46ba5237321' I0916 21:26:23.374958 8598 exec.cpp:164] Version: 1.9.0 I0916 21:26:23.387948 8614 exec.cpp:237] Executor registered on agent b61fd963-8537-48f0-9eb6-e26f3aa97265-S0 I0916 21:26:23.390528 8604 executor.cpp:190] Received SUBSCRIBED event I0916 21:26:23.391326 8604 executor.cpp:194] Subscribed executor on worker4 I0916 21:26:23.391512 8604 executor.cpp:190] Received LAUNCH event I0916 21:26:23.392763 8604 executor.cpp:722] Starting task driver-20200916212624-0041 I0916 21:26:23.409191 8604 executor.cpp:738] Forked command at 8622 20/09/16 21:26:25 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties 20/09/16 21:26:25 WARN DependencyUtils: Local jar /var/lib/mesos/slaves/b61fd963-8537-48f0-9eb6-e26f3aa97265-S0/frameworks/92ca9c69-72c9-43d1-828e-ecc8bac62eff-0000/executors/driver-20200916212624-0041/runs/46a1e00e-0c01-47b5-82f5-a46ba5237321/spark.driver.supervise=false does not exist, skipping. Error: Failed to load class org.apache.spark.examples.SparkPi. 20/09/16 21:26:25 INFO ShutdownHookManager: Shutdown hook called 20/09/16 21:26:25 INFO ShutdownHookManager: Deleting directory /tmp/spark-0c04f617-9daf-4a4b-8efe-e7d48e1eb06f I0916 21:26:25.802945 8601 executor.cpp:1039] Command exited with status 101 (pid: 8622) I0916 21:26:26.809671 8619 process.cpp:935] Stopped the socket accept loop

Within the above error message, I noticed that spark.driver.supervise=false is referenced in the executor path when trying to load the jar files.

20/09/16 21:26:25 WARN DependencyUtils: Local jar /var/lib/mesos/slaves/b61fd963-8537-48f0-9eb6-e26f3aa97265-S0/frameworks/92ca9c69-72c9-43d1-828e-ecc8bac62eff-0000/executors/driver-20200916212624-0041/runs/46a1e00e-0c01-47b5-82f5-a46ba5237321/spark.driver.supervise=false does not exist, skipping.

I think the problem of failing to load the class is due to this incorrect reference.

Any suggestion?

Looking into the debug message of spark-submit, I found the following.

Spark config:
(spark.jars,file:/spark-3.0.0-bin-SparkFHE/examples/jars/spark-examples_2.12-3.0.0.jar)
(spark.driver.supervise,false)
(spark.app.name,test2)
**(spark.submit.pyFiles,)**
(spark.master.rest.enabled,true)
(spark.submit.deployMode,cluster)
(spark.master,mesos://master:7077)
Classpath elements:

I noticed that (spark.submit.pyFiles,) is empty. I didn't plan to use python. Not sure why this option is turned on.

Furthermore, I tried debugging in the function "def doSubmit(args: Array[String])" within SparkSubmit.scala.

I tried to print the args array.

for (arg <- args) { logWarning(s"doSubmit: $arg") }

Somehow the following is included --py-files without any value.

1

1 Answers

1
votes

It's a bug in 3.0.0 and also present in 3.0.1.

https://issues.apache.org/jira/browse/SPARK-32675

It's supposed to be fixed in 3.1.0 which has RC targeted for Jan 2021 https://spark.apache.org/versioning-policy.html

This issue is happening because --py-files gets appended even if you don't explicitly pass it to spark-submit. I applied the PR to 3.0.1 source and rebuilt the distributions and that fixed the issue for me.

Here's what I did:

  1. Edit resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosClusterScheduler.scala

Replace line 538:

options ++= Seq("--py-files", formattedFiles)

With:

if (!formattedFiles.equals("")) {
  options ++= Seq("--py-files", formattedFiles)
}
  1. Edit resource-managers/mesos/src/test/scala/org/apache/spark/scheduler/cluster/mesos/MesosClusterManagerSuite.scala

Replace line 595:

 "--driver-cores 1.0 --driver-memory 1000M --class Main --py-files  " +

With:

"--driver-cores 1.0 --driver-memory 1000M --class Main " +
  1. Rebuild using: ./dev/make-distribution.sh --tgz -Pmesos