how to spark submit job to yarn on other cluster?

Question

I have a docker container with spark installed and i am trying to submit job to yarn on other cluster using marathon . The docker container has the exported values of yarn and hadoop conf dir, the yarn file also contains the correct address of the emr master ip , but i am not sure from where its taking as localhost?

ENV YARN_CONF_DIR="/opt/yarn-site.xml"
ENV HADOOP_CONF_DIR="/opt/spark-2.2.0-bin-hadoop2.6"

Yarn.xml

<property>
    <name>yarn.resourcemanager.hostname</name>
    <value>xx.xxx.x.xx</value>
  </property>

Command:

  "cmd": "/opt/spark-2.2.0-bin-hadoop2.6/bin/spark-submit --verbose \\\n --name emr_external_mpv_streaming \\\n --deploy-mode client \\\n --master yarn\\\n --conf spark.executor.instances=4 \\\n --conf spark.executor.cores=1 \\\n --conf spark.executor.memory=1g \\\n --conf spark.driver.memory=1g \\\n --conf spark.cores.max=4 \\\n --conf spark.executorEnv.EXT_WH_HOST=$EXT_WH_HOST \\\n --conf spark.executorEnv.EXT_WH_PASSWORD=$EXT_WH_PASSWORD \\\n --conf spark.executorEnv.KAFKA_BROKER_LIST=$_KAFKA_BROKER_LIST \\\n --conf spark.executorEnv.SCHEMA_REGISTRY_URL=$SCHEMA_REGISTRY_URL \\\n --conf spark.executorEnv.AWS_ACCESS_KEY_ID=$AWS_ACCESS_KEY_ID \\\n --conf spark.executorEnv.AWS_SECRET_ACCESS_KEY=$AWS_SECRET_ACCESS_KEY \\\n --conf spark.executorEnv.STAGING_S3_BUCKET=$STAGING_S3_BUCKET \\\n --conf spark.executorEnv.KAFKA_GROUP_ID=$KAFKA_GROUP_ID \\\n --conf spark.executorEnv.MAX_RATE=$MAX_RATE \\\n --conf spark.executorEnv.KAFKA_MAX_POLL_MS=$KAFKA_MAX_POLL_MS \\\n --conf spark.executorEnv.KAFKA_MAX_POLL_RECORDS=$KAFKA_MAX_POLL_RECORDS \\\n --class com.ticketnetwork.edwstream.external.MapPageView \\\n /opt/edw-stream-external-mpv_2.11-2-SNAPSHOT.jar",

I tried specifying --deploy-mode cluster \\n --master yarn\\n -- same error

Error:

Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
18/09/10 20:41:24 INFO SparkContext: Running Spark version 2.2.0
18/09/10 20:41:25 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
18/09/10 20:41:25 INFO SparkContext: Submitted application: edw-stream-ext-mpv-emr-prod
18/09/10 20:41:25 INFO SecurityManager: Changing view acls to: root
18/09/10 20:41:25 INFO SecurityManager: Changing modify acls to: root
18/09/10 20:41:25 INFO SecurityManager: Changing view acls groups to: 
18/09/10 20:41:25 INFO SecurityManager: Changing modify acls groups to: 
18/09/10 20:41:25 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(root); groups with view permissions: Set(); users  with modify permissions: Set(root); groups with modify permissions: Set()
18/09/10 20:41:25 INFO Utils: Successfully started service 'sparkDriver' on port 35868.
18/09/10 20:41:25 INFO SparkEnv: Registering MapOutputTracker
18/09/10 20:41:25 INFO SparkEnv: Registering BlockManagerMaster
18/09/10 20:41:25 INFO BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
18/09/10 20:41:25 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up
18/09/10 20:41:25 INFO DiskBlockManager: Created local directory at /tmp/blockmgr-5526b967-2be9-44bf-a86f-79ef72f2ac0f
18/09/10 20:41:25 INFO MemoryStore: MemoryStore started with capacity 366.3 MB
18/09/10 20:41:26 INFO SparkEnv: Registering OutputCommitCoordinator
18/09/10 20:41:26 INFO Utils: Successfully started service 'SparkUI' on port 4040.
18/09/10 20:41:26 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at http://10.150.4.45:4040
18/09/10 20:41:26 INFO SparkContext: Added JAR file:/opt/edw-stream-external-mpv_2.11-2-SNAPSHOT.jar at spark://10.150.4.45:35868/jars/edw-stream-external-mpv_2.11-2-SNAPSHOT.jar with timestamp 1536612086416
18/09/10 20:41:26 INFO RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
18/09/10 20:41:27 INFO Client: Retrying connect to server: 0.0.0.0/0.0.0.0:8032. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
18/09/10 20:41:28 INFO Client: Retrying connect to server: 0.0.0.0/0.0.0.0:8032. Already tried 1 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
18/09/10 20:41:29 INFO Client: Retrying connect to server: 0.0.0.0/0.0.0.0:8032. Already tried 2 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)

If the output says at /0.0.0.0:8032, then something in your XML also says that, so it needs to point at the correct IP/DNS address... And your HADOOP_CONF_DIR need to be Spark's conf folder, not the base folder — OneCricketeer

OneCricketeer OneCricketeer · Accepted Answer · 2018-09-10T23:32:16

0.0.0.0 is the default hostname property, and 8032 is the default port number.

One reason you're getting defaults would be neither of Hadoop environment variables are correctly set. Your HADOOP_CONF_DIR need to be Spark's (or Hadoop's) conf folder, not the base folder from the Spark extraction. This directory must contain core-site.xml, yarn-site.xml, hdfs-site.xml, and hive-site.xml if using HiveContext

Then if yarn-site.xml is in the above location, you don't need YARN_CONF_DIR, but if you do set it, it needs to be an actual directory, not directly to the file.

Additionally, you'll probably need to set more than just one hostname. For example, a production grade YARN cluster would have two ResourceManagers for fault tolerance. Additionally, maybe some Kerberos keytabs and principals would need set if you had that enabled.

If you already have Mesos/Marathon, though, I'm not sure why you'd want to use YARN

how to spark submit job to yarn on other cluster?

1 Answers