0
votes

There is a spark master installed on a host. The spark is running in standalone mode with workers on separate nodes. All the spark infrastructure is running without docker. And there is a docker container for airflow running on the spark master host. The container starts like this

 docker run -d --network host -v /usr/share/javazi-1.8/:/usr/share/javazi-1.8 -v  
 /home/airflow/dags/:/usr/local/airflow/dags -v /home/spark-2.3.3/:/home/spark-2.3.3 -v  
 /usr/local/hadoop/:/usr/local/hadoop -v /usr/lib/jvm/java/:/usr/lib/jvm/java -v` 
 /usr/local/opt/:/usr/local/opt airflow

So spark-submit is specified as a volume. And the container uses host network.
I am trying to submit my spark job from the docker container, like this:

/home/spark-2.3.3/bin/spark-submit --master=spark://spark-master.net:7077 
--class=com.mysparkjob.Main --driver-memory=4G --executor-cores=6 
--total-executor-cores=12 --executor-memory=10G /home/spark/my-job.jar

but execution freezes on these logs

2020-07-06 20:34:21 INFO  SparkContext:54 - Running Spark version 2.3.3
2020-07-06 20:34:21 WARN  SparkConf:66 - In Spark 1.0 and later spark.local.dir will be overridden by the value set by the cluster manager (via SPARK_LOCAL_DIRS in mesos/standalone and LOCAL_DIRS in YARN).
2020-07-06 20:34:21 INFO  SparkContext:54 - Submitted application: My app
2020-07-06 20:34:21 INFO  SecurityManager:54 - Changing view acls to: root
2020-07-06 20:34:21 INFO  SecurityManager:54 - Changing modify acls to: root
2020-07-06 20:34:21 INFO  SecurityManager:54 - Changing view acls groups to: 
2020-07-06 20:34:21 INFO  SecurityManager:54 - Changing modify acls groups to: 
2020-07-06 20:34:21 INFO  SecurityManager:54 - SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(root); groups with view permissions: Set(); users  with modify permissions: Set(root); groups with modify permissions: Set()
2020-07-06 20:34:21 INFO  Utils:54 - Successfully started service 'sparkDriver' on port 46677.
2020-07-06 20:34:21 INFO  SparkEnv:54 - Registering MapOutputTracker
2020-07-06 20:34:21 INFO  SparkEnv:54 - Registering BlockManagerMaster
2020-07-06 20:34:21 INFO  BlockManagerMasterEndpoint:54 - Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
2020-07-06 20:34:21 INFO  BlockManagerMasterEndpoint:54 - BlockManagerMasterEndpoint up
2020-07-06 20:34:21 INFO  DiskBlockManager:54 - Created local directory at /home/sparkdata/blockmgr-3b52d93a-149e-49a2-9664-ce19fc12e76e
2020-07-06 20:34:21 INFO  MemoryStore:54 - MemoryStore started with capacity 2004.6 MB
2020-07-06 20:34:21 INFO  SparkEnv:54 - Registering OutputCommitCoordinator
2020-07-06 20:34:21 INFO  log:192 - Logging initialized @83360ms
2020-07-06 20:34:21 INFO  Server:351 - jetty-9.3.z-SNAPSHOT, build timestamp: unknown, git hash: unknown
2020-07-06 20:34:21 INFO  Server:419 - Started @83405ms
2020-07-06 20:34:21 INFO  AbstractConnector:278 - Started ServerConnector@240a2619{HTTP/1.1,[http/1.1]}{my_ip:4040}
2020-07-06 20:34:21 INFO  Utils:54 - Successfully started service 'SparkUI' on port 4040.
2020-07-06 20:34:21 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@3bd08435{/jobs,null,AVAILABLE,@Spark}
2020-07-06 20:34:21 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@65859b44{/jobs/json,null,AVAILABLE,@Spark}
2020-07-06 20:34:21 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@d9f5fce{/jobs/job,null,AVAILABLE,@Spark}
2020-07-06 20:34:21 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@45b7c97f{/jobs/job/json,null,AVAILABLE,@Spark}
2020-07-06 20:34:21 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@c212536{/stages,null,AVAILABLE,@Spark}
2020-07-06 20:34:21 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@7b377a53{/stages/json,null,AVAILABLE,@Spark}
2020-07-06 20:34:21 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@1b0e031b{/stages/stage,null,AVAILABLE,@Spark}
2020-07-06 20:34:21 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@25214797{/stages/stage/json,null,AVAILABLE,@Spark}
2020-07-06 20:34:21 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@4e5c8ef3{/stages/pool,null,AVAILABLE,@Spark}
2020-07-06 20:34:21 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@60928a61{/stages/pool/json,null,AVAILABLE,@Spark}
2020-07-06 20:34:21 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@27358a19{/storage,null,AVAILABLE,@Spark}
2020-07-06 20:34:21 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@8077c97{/storage/json,null,AVAILABLE,@Spark}
2020-07-06 20:34:21 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@22865072{/storage/rdd,null,AVAILABLE,@Spark}
2020-07-06 20:34:21 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@563317c1{/storage/rdd/json,null,AVAILABLE,@Spark}
2020-07-06 20:34:21 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@5d5d3a5c{/environment,null,AVAILABLE,@Spark}
2020-07-06 20:34:21 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@6e0d16a4{/environment/json,null,AVAILABLE,@Spark}
2020-07-06 20:34:21 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@7e18ced7{/executors,null,AVAILABLE,@Spark}
2020-07-06 20:34:21 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@305b43ca{/executors/json,null,AVAILABLE,@Spark}
2020-07-06 20:34:21 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@4601047{/executors/threadDump,null,AVAILABLE,@Spark}
2020-07-06 20:34:21 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@25e8e59{/executors/threadDump/json,null,AVAILABLE,@Spark}
2020-07-06 20:34:21 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@3a0896b3{/static,null,AVAILABLE,@Spark}
2020-07-06 20:34:21 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@635ff2a5{/,null,AVAILABLE,@Spark}
2020-07-06 20:34:21 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@55adcf9e{/api,null,AVAILABLE,@Spark}
2020-07-06 20:34:21 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@58601e7a{/jobs/job/kill,null,AVAILABLE,@Spark}
2020-07-06 20:34:21 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@62735b13{/stages/stage/kill,null,AVAILABLE,@Spark}
2020-07-06 20:34:21 INFO  SparkUI:54 - Bound SparkUI to my_ip, and started at http://my_ip:4040
2020-07-06 20:34:21 INFO  SparkContext:54 - Added JAR file:/home/spark/my-job.jar at spark://my_ip:46677/jars/my-job.jar with timestamp 1594067661464
2020-07-06 20:34:21 WARN  FairSchedulableBuilder:66 - Fair Scheduler configuration file not found so jobs will be scheduled in FIFO order. To use fair scheduling, configure pools in fairscheduler.xml or set spark.scheduler.allocation.file to a file that contains the configuration.
2020-07-06 20:34:21 INFO  FairSchedulableBuilder:54 - Created default pool: default, schedulingMode: FIFO, minShare: 0, weight: 1
2020-07-06 20:34:21 INFO  StandaloneAppClient$ClientEndpoint:54 - Connecting to master spark://spark-master.net:7077...
2020-07-06 20:34:21 INFO  TransportClientFactory:267 - Successfully created connection to spark-master.net/my_ip:7077 after 14 ms (0 ms spent in bootstraps)
2020-07-06 20:34:21 INFO  StandaloneSchedulerBackend:54 - Connected to Spark cluster with app ID app-20200706223421-1147
2020-07-06 20:34:21 INFO  Utils:54 - Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 33659.
2020-07-06 20:34:21 INFO  NettyBlockTransferService:54 - Server created on my_ip:33659
2020-07-06 20:34:21 INFO  BlockManager:54 - Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
2020-07-06 20:34:21 INFO  BlockManagerMaster:54 - Registering BlockManager BlockManagerId(driver, my_ip, 33659, None)
2020-07-06 20:34:21 INFO  BlockManagerMasterEndpoint:54 - Registering block manager my_ip:33659 with 2004.6 MB RAM, BlockManagerId(driver, my_ip, 33659, None)
2020-07-06 20:34:21 INFO  BlockManagerMaster:54 - Registered BlockManager BlockManagerId(driver, my_ip, 33659, None)
2020-07-06 20:34:21 INFO  BlockManager:54 - external shuffle service port = 8888
2020-07-06 20:34:21 INFO  BlockManager:54 - Initialized BlockManager: BlockManagerId(driver, my_ip, 33659, None)
2020-07-06 20:34:21 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@2bc16fe2{/metrics/json,null,AVAILABLE,@Spark}
2020-07-06 20:34:21 INFO  EventLoggingListener:54 - Logging events to hdfs://my_hdfs_ip:54310/sparkEventLogs/app-20200706223421-1147
2020-07-06 20:34:21 INFO  Utils:54 - Using initial executors = 0, max of spark.dynamicAllocation.initialExecutors, spark.dynamicAllocation.minExecutors and spark.executor.instances
2020-07-06 20:34:21 INFO  StandaloneSchedulerBackend:54 - SchedulerBackend is ready for scheduling beginning after reached minRegisteredResourcesRatio: 0.0

And the spark job isn't being executed further. It looks like some network issues. Maybe workers can't reach spark master if a job was submitted from a container? I will be glad to get any advice or help from you guys. Thanks

1

1 Answers

0
votes

Most probably executors couldn't reach the driver that is running in container. You need to look to the option spark.driver.host and set it to the IP of the container that is visible from outside, otherwise Spark in container will advertise the internal Docker network address. You also need to set spark.driver.bindAddress to the address local to the container, so Spark would able to perform bind.

From the documentation:

spark.driver.bindAddress - It also allows a different address from the local one to be advertised to executors or external systems. This is useful, for example, when running containers with bridged networking. For this to properly work, the different ports used by the driver (RPC, block manager and UI) need to be forwarded from the container's host.