1
votes

I have trouble to get my program to run on my spark cluster. I set the cluster up with 1 master and 4 slaves. I started the master, after that, I started the slaves and they show up in the master's web ui.

I then start a small python script to check, if jobs can be executed:

from pyspark import * #SparkContext, SparkConf, spark
from pyspark.sql import SparkSession 
from pyspark.sql.types import *
from pyspark.sql import SQLContext

from files import files

import sys


if __name__ == "__main__":

    appName = 'SparkExample'
    masterUrl = 'spark://10.0.2.55:7077'

    conf = SparkConf()
    conf.setAppName(appName)
    conf.setMaster(masterUrl)
    conf.set("spark.driver.cores","1")
    conf.set("spark.driver.memory","1g")
    conf.set("spark.executor.cores","1")
    conf.set("spark.executor.memory","4g")
    conf.set("spark.python.worker.memory","256m")
    
    conf.set("spark.cores.max","4")
    
    conf.set("spark.shuffle.service.enabled","true")
    conf.set("spark.dynamicAllocation.enabled","true")
    conf.set("spark.dynamicAllocation.maxExecutors","1")
    
    
    for k,v in conf.getAll():
        print(k+":"+v)
    
    spark = SparkSession.builder.config(conf=conf).getOrCreate()
    #spark = SparkSession.builder.master(masterUrl).appName(appName).config("spark.executor.memory","1g").getOrCreate()
    
    l = [('Alice', 1)]
    spark.createDataFrame(l).collect()
    spark.createDataFrame(l, ['name', 'age']).collect()


    print("#############")
    print("Test finished")
    print("#############")

But as soon as I should get something back (line 45: " spark.createDataFrame(l).collect()"), spark seems to hang up. After a while, I see the message:

"WARN TaskSchedulerImpl: Initial job has not accepted any resources: check your cluster UI to ensure that workers are registered and have sufficient resources"

So I check the cluster UI:

worker-20171027105227-xx.x.x.x6-35309   10.0.2.56:35309 ALIVE   4 (0 Used)  6.8 GB (0.0 B Used)
worker-20171027110202-xx.x.x.x0-43433   10.0.2.10:43433 ALIVE   16 (1 Used) 30.4 GB (4.0 GB Used)
worker-20171027110746-xx.x.x.x5-45126   10.0.2.65:45126 ALIVE   8 (0 Used)  30.4 GB (0.0 B Used)
worker-20171027110939-xx.x.x.x4-42477   10.0.2.64:42477 ALIVE   16 (0 Used) 30.4 GB (0.0 B Used)

Looks like there are plenty of resources for the small task I created. I also see the task actually running there. When I click on it, I see, that it was launched on 5 executors and all but one EXITED. When I open the log on one of the exited ones, I see the following error message:

Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
17/10/27 16:45:23 INFO CoarseGrainedExecutorBackend: Started daemon with process name: 14443@CODA
17/10/27 16:45:23 INFO SignalUtils: Registered signal handler for TERM
17/10/27 16:45:23 INFO SignalUtils: Registered signal handler for HUP
17/10/27 16:45:23 INFO SignalUtils: Registered signal handler for INT
17/10/27 16:45:24 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
17/10/27 16:45:24 INFO SecurityManager: Changing view acls to: root,geissler
17/10/27 16:45:24 INFO SecurityManager: Changing modify acls to: root,geissler
17/10/27 16:45:24 INFO SecurityManager: Changing view acls groups to: 
17/10/27 16:45:24 INFO SecurityManager: Changing modify acls groups to: 
17/10/27 16:45:24 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(root, geissler); groups with view permissions: Set(); users  with modify permissions: Set(root, geissler); groups with modify permissions: Set()
17/10/27 16:47:25 ERROR RpcOutboxMessage: Ask timeout before connecting successfully
Exception in thread "main" java.lang.reflect.UndeclaredThrowableException
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1713)
	at org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:66)
	at org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:188)
	at org.apache.spark.executor.CoarseGrainedExecutorBackend$.main(CoarseGrainedExecutorBackend.scala:284)
	at org.apache.spark.executor.CoarseGrainedExecutorBackend.main(CoarseGrainedExecutorBackend.scala)
Caused by: org.apache.spark.rpc.RpcTimeoutException: Cannot receive any reply in 120 seconds. This timeout is controlled by spark.rpc.askTimeout
	at org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:48)
	at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:63)
	at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
	at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
	at scala.util.Failure$$anonfun$recover$1.apply(Try.scala:216)
	at scala.util.Try$.apply(Try.scala:192)
	at scala.util.Failure.recover(Try.scala:216)
	at scala.concurrent.Future$$anonfun$recover$1.apply(Future.scala:326)
	at scala.concurrent.Future$$anonfun$recover$1.apply(Future.scala:326)
	at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32)
	at org.spark_project.guava.util.concurrent.MoreExecutors$SameThreadExecutorService.execute(MoreExecutors.java:293)
	at scala.concurrent.impl.ExecutionContextImpl$$anon$1.execute(ExecutionContextImpl.scala:136)
	at scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:40)
	at scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:248)
	at scala.concurrent.Promise$class.complete(Promise.scala:55)
	at scala.concurrent.impl.Promise$DefaultPromise.complete(Promise.scala:153)
	at scala.concurrent.Future$$anonfun$map$1.apply(Future.scala:237)
	at scala.concurrent.Future$$anonfun$map$1.apply(Future.scala:237)
	at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32)
	at scala.concurrent.BatchingExecutor$Batch$$anonfun$run$1.processBatch$1(BatchingExecutor.scala:63)
	at scala.concurrent.BatchingExecutor$Batch$$anonfun$run$1.apply$mcV$sp(BatchingExecutor.scala:78)
	at scala.concurrent.BatchingExecutor$Batch$$anonfun$run$1.apply(BatchingExecutor.scala:55)
	at scala.concurrent.BatchingExecutor$Batch$$anonfun$run$1.apply(BatchingExecutor.scala:55)
	at scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:72)
	at scala.concurrent.BatchingExecutor$Batch.run(BatchingExecutor.scala:54)
	at scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601)
	at scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:106)
	at scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599)
	at scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:40)
	at scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:248)
	at scala.concurrent.Promise$class.tryFailure(Promise.scala:112)
	at scala.concurrent.impl.Promise$DefaultPromise.tryFailure(Promise.scala:153)
	at org.apache.spark.rpc.netty.NettyRpcEnv.org$apache$spark$rpc$netty$NettyRpcEnv$$onFailure$1(NettyRpcEnv.scala:205)
	at org.apache.spark.rpc.netty.NettyRpcEnv$$anon$1.run(NettyRpcEnv.scala:239)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:748)
Caused by: java.util.concurrent.TimeoutException: Cannot receive any reply in 120 seconds
	... 8 more

This looks as if the slaves cannot provide their results back to the master to me. But I don't know what to do at this point. The slaves are in the same layer of the network as the master, but on different virtual machines (not docker containers). Is there a way how I can check, if they can/cannot reach the master server? Are there any configuration settings I overlooked when setting the cluster up?

Spark version: 2.1.2 (on master, nodes and pyspark)

1

1 Answers

0
votes

The error here was, that the python script was executed locally. Always launch your spark scripts through spark-submit, never just run it as a normal program. Same is true for Java spark programs.