14
votes

I have a generic question about Apache Spark :

We have some spark streaming scripts that consume Kafka messages. Problem : they are failing randomly without a specific error...

Some script does nothing while they are working when I run them manually, one is failing with this message :

ERROR SparkUI: Failed to bind SparkUI java.net.BindException: Address already in use: Service 'SparkUI' failed after 16 retries!

So I'm wondering if there is maybe a specific way to run the scripts in parallel ?

They are all in the same jar and I run them with Supervisor. Spark is installed on Cloudera Manager 5.4 on Yarn.

Here is how I launch a script :

sudo -u spark spark-submit --class org.soprism.kafka.connector.reader.TwitterPostsMessageWriter /home/soprism/sparkmigration/data-migration-assembly-1.0.jar --master yarn-cluster --deploy-mode client

Thanks for your help !

Update : I changed the command and now run this (it stops with now specific message) :

root@ns6512097:~# sudo -u spark spark-submit --class org.soprism.kafka.connector.reader.TwitterPostsMessageWriter --master yarn --deploy-mode client /home/soprism/sparkmigration/data-migration-assembly-1.0.jar
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-5.4.7-1.cdh5.4.7.p0.3/jars/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-5.4.7-1.cdh5.4.7.p0.3/jars/avro-tools-1.7.6-cdh5.4.7.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
15/09/28 16:14:21 INFO Remoting: Starting remoting
15/09/28 16:14:21 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://[email protected]:52748]
15/09/28 16:14:21 INFO Remoting: Remoting now listens on addresses: [akka.tcp://[email protected]:52748]
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-5.4.7-1.cdh5.4.7.p0.3/jars/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-5.4.7-1.cdh5.4.7.p0.3/jars/avro-tools-1.7.6-cdh5.4.7.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
3
You should not worry about the errors on binding SparkUI address, since spark will automatically increment the port number SparkUI is bound. Do you have any other clue? Maybe share the complete logs?mehmetminanc
Unfortunately, I have no more clue, the logs are normal except the error I pasted :( That's why I'm here asking if what we've done is correct... and it seems to be the case ?Taoma_k
Well, one problem with your submit code is that --master ... and --deploy-mode are after the jar, those will be dismissed. Can you try sudo -u spark spark-submit --class org.soprism.kafka.connector.reader.TwitterPostsMessageWriter --master yarn-cluster --deploy-mode client /home/soprism/sparkmigration/data-migration-assembly-1.0.jarmehmetminanc
I updated my post to add your suggestion and the result :)Taoma_k
BTW, are you sure unix spark user has read access on the jar? It's located in another unix user's home dir. And these are not nearly all the logs that should've been produced.mehmetminanc

3 Answers

12
votes

This issue occurs if multiple users tries to start spark session at the same time or existing spark session are not property closed

There are two ways to fix this issue.

  • Start new spark session on a different port as follow

    spark-submit --conf spark.ui.port=5051 <other arguments>`<br>`spark-shell --conf spark.ui.port=5051
    
  • Find all spark session using ports from 4041 to 4056 and kill process using kill command, netstat and kill command can be used to find process which are occupying the port and kill the process respectively. Here's the usage:

    sudo netstat -tunalp | grep LISTEN| grep 4041
    

Above command will produce output as below, last column is process id, in this case PID is 32028

tcp        0      0 :::4040    :::*         LISTEN      32028/java

Once you find out the process id(PID) you can kill the spark process(spark-shell or spark-submit) using the below command

sudo kill -9 32028
5
votes

You could also bump up the value set for spark.port.maxRetries .

As per the docs:

Maximum number of retries when binding to a port before giving up. When a port is given a specific value (non 0), each subsequent retry will increment the port used in the previous attempt by 1 before retrying. This essentially allows it to try a range of ports from the start port specified to port + maxRetries.

2
votes

Above answers are correct. However, we should not try and change the spark.port.maxRetries values, as it will increase load on the same server, which in turn will depreciate the cluster performance and can push the node to a deadlock situations.Load can be checked with uptime command in your session.

The root cause of this issue is when you try to run all spark application via --deploy-mode client.

If you have a distributed capacity in your cluster, the best approach is to run it with --deploy-mode cluster.

This way, every time it will run the spark application in different nodes, hence mitigating the port binding issues on the same node.