0
votes

In my spark-env.sh I have these settings:

SPARK_LOCAL_IP=127.0.0.1
SPARK_MASTER_HOST=127.0.0.1
SPARK_WORKER_INSTANCES=2
SPARK_WORKER_MEMORY=1000m
SPARK_WORKER_CORES=1

I start the master using start-master.sh and then I start the slaves/workers using start-slave.sh spark://localhost:7077
The mater web UI is showing fine but it shows only ONE worker started. This is the log of the first worker (which is working fine):

Spark Command: /usr/lib/jvm/java-8-openjdk-amd64/bin/java -cp /media/ahmedn1/Ahmedn12/spark/conf/:/media/ahmedn1/Ahmedn12/spark/jars/* -Xmx1g org.apache.spark.deploy.worker.Worker --webui-port 8081 spark://localhost:7077

17/08/30 12:19:31 INFO Worker: Started daemon with process name: 28769@ahmedn1-Inspiron-5555
17/08/30 12:19:31 INFO SignalUtils: Registered signal handler for TERM
17/08/30 12:19:31 INFO SignalUtils: Registered signal handler for HUP
17/08/30 12:19:31 INFO SignalUtils: Registered signal handler for INT
17/08/30 12:19:33 INFO SecurityManager: Changing view acls to: ahmedn1
17/08/30 12:19:33 INFO SecurityManager: Changing modify acls to: ahmedn1
17/08/30 12:19:33 INFO SecurityManager: Changing view acls groups to:
17/08/30 12:19:33 INFO SecurityManager: Changing modify acls groups to:
17/08/30 12:19:33 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(ahmedn1); groups with view permissions: Set(); users with modify permissions: Set(ahmedn1); groups with modify permissions: Set()
17/08/30 12:19:34 INFO Utils: Successfully started service 'sparkWorker' on port 43479.
17/08/30 12:19:35 INFO Worker: Starting Spark worker 127.0.0.1:43479 with 2 cores, 1000.0 MB RAM
17/08/30 12:19:35 INFO Worker: Running Spark version 2.2.0
17/08/30 12:19:35 INFO Worker: Spark home: /media/ahmedn1/Ahmedn12/spark
17/08/30 12:19:35 INFO ExternalShuffleService: Starting shuffle service on port 7337 (auth enabled = false)
17/08/30 12:19:35 INFO Utils: Successfully started service 'WorkerUI' on port 8081.
17/08/30 12:19:35 INFO WorkerWebUI: Bound WorkerWebUI to 127.0.0.1, and started at http://127.0.0.1:8081
17/08/30 12:19:35 INFO Worker: Connecting to master localhost:7077...
17/08/30 12:19:36 INFO TransportClientFactory: Successfully created connection to localhost/127.0.0.1:7077 after 309 ms (0 ms spent in bootstraps)
17/08/30 12:19:37 INFO Worker: Successfully registered with master spark://127.0.0.1:7077

and this is the log of the second worker which apparently failed to start:

Spark Command: /usr/lib/jvm/java-8-openjdk-amd64/bin/java -cp /media/ahmedn1/Ahmedn12/spark/conf/:/media/ahmedn1/Ahmedn12/spark/jars/*

-Xmx1g org.apache.spark.deploy.worker.Worker --webui-port 8082 spark://localhost:7077

17/08/30 12:19:34 INFO Worker: Started daemon with process name: 28819@ahmedn1-Inspiron-5555
17/08/30 12:19:34 INFO SignalUtils: Registered signal handler for TERM
17/08/30 12:19:34 INFO SignalUtils: Registered signal handler for HUP
17/08/30 12:19:34 INFO SignalUtils: Registered signal handler for INT
17/08/30 12:19:36 INFO SecurityManager: Changing view acls to: ahmedn1
17/08/30 12:19:36 INFO SecurityManager: Changing modify acls to: ahmedn1
17/08/30 12:19:36 INFO SecurityManager: Changing view acls groups to:
17/08/30 12:19:36 INFO SecurityManager: Changing modify acls groups to:
17/08/30 12:19:36 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(ahmedn1); groups with view permissions: Set(); users with modify permissions: Set(ahmedn1); groups with modify permissions: Set()
17/08/30 12:19:37 INFO Utils: Successfully started service 'sparkWorker' on port 46067.
17/08/30 12:19:38 INFO Worker: Starting Spark worker 127.0.0.1:46067 with 2 cores, 1000.0 MB RAM
17/08/30 12:19:38 INFO Worker: Running Spark version 2.2.0
17/08/30 12:19:38 INFO Worker: Spark home: /media/ahmedn1/Ahmedn12/spark
17/08/30 12:19:38 INFO ExternalShuffleService: Starting shuffle service on port 7337 (auth enabled = false)
17/08/30 12:19:38 ERROR Inbox: Ignoring error java.net.BindException: Address already in use at sun.nio.ch.Net.bind0(Native Method) at sun.nio.ch.Net.bind(Net.java:433) at sun.nio.ch.Net.bind(Net.java:425) at sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:223) at io.netty.channel.socket.nio.NioServerSocketChannel.doBind(NioServerSocketChannel.java:127) at io.netty.channel.AbstractChannel$AbstractUnsafe.bind(AbstractChannel.java:501) at io.netty.channel.DefaultChannelPipeline$HeadContext.bind(DefaultChannelPipeline.java:1218) at io.netty.channel.AbstractChannelHandlerContext.invokeBind(AbstractChannelHandlerContext.java:496) at io.netty.channel.AbstractChannelHandlerContext.bind(AbstractChannelHandlerContext.java:481) at io.netty.channel.DefaultChannelPipeline.bind(DefaultChannelPipeline.java:965) at io.netty.channel.AbstractChannel.bind(AbstractChannel.java:210) at io.netty.bootstrap.AbstractBootstrap$2.run(AbstractBootstrap.java:353) at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:399) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:446) at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:131) at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144) at java.lang.Thread.run(Thread.java:748)

So, I see the problem is in Address Binding which might be related to ports. But isn't it supposed to automatically select a free port?

1

1 Answers

0
votes

So, I noticed that the exception happens after trying to start the External Shuffle Service for the second worker.
After some digging in source codes, I found out that there should be one shuffle service running per cluster.

// With external shuffle service enabled, if we request to launch multiple workers on one host,
// we can only successfully launch the first worker and the rest fails, because with the port
// bound, we may launch no more than one external shuffle service on each host.
// When this happens, we should give explicit reason of failure instead of fail silently. For
// more detail see SPARK-20989.
val externalShuffleServiceEnabled = conf.getBoolean("spark.shuffle.service.enabled", false)
val sparkWorkerInstances = scala.sys.env.getOrElse("SPARK_WORKER_INSTANCES", "1").toInt
require(externalShuffleServiceEnabled == false || sparkWorkerInstances <= 1,
  "Starting multiple workers on one host is failed because we may launch no more than one " +
    "external shuffle service on each host, please set spark.shuffle.service.enabled to " +
    "false or set SPARK_WORKER_INSTANCES to 1 to resolve the conflict.")

So, in my conditions I should only have one cluster or turn off the shuffle service using:

spark.dynamicAllocation.enabled false
spark.shuffle.service.enabled   false

When I did this, it solved the problem.