1
votes

I've installed a Spark cluster on 4 different machines. Each machine has 7.7gb of memory with an 8-core i7 processor. I'm using Pyspark and trying to load 5 numpy arrays (2.9gb each) into the cluster. They are all part of a bigger 14gb numpy array that I generated on a separate machine. I'm trying to run a simple count function on the first rdd to make sure my cluster is running properly. I get the following warning upon execution:

>>> import numpy as np
>>> gen1 = sc.parallelize(np.load('/home/hduser/gen1.npy'),512)
>>> gen1.count()
[Stage 0:>                                                        (0 + 0) / 512]
17/01/28 13:07:07 WARN scheduler.TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
17/01/28 13:07:22 WARN scheduler.TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
17/01/28 13:07:37 WARN scheduler.TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
[Stage 0:>                                                        (0 + 0) / 512]
17/01/28 13:07:52 WARN scheduler.TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
^C
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/opt/spark/python/pyspark/rdd.py", line 1008, in count
    return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum()
  File "/opt/spark/python/pyspark/rdd.py", line 999, in sum
    return self.mapPartitions(lambda x: [sum(x)]).fold(0, operator.add)
  File "/opt/spark/python/pyspark/rdd.py", line 873, in fold
    vals = self.mapPartitions(func).collect()
  File "/opt/spark/python/pyspark/rdd.py", line 776, in collect
    port = self.ctx._jvm.PythonRDD.collectAndServe(self._jrdd.rdd())
  File "/opt/spark/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py", line 931, in __call__
  File "/opt/spark/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py", line 695, in send_command
  File "/opt/spark/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py", line 828, in send_command
  File "/home/hduser/anaconda2/lib/python2.7/socket.py", line 451, in readline
    data = self._sock.recv(self._rbufsize)
  File "/opt/spark/python/pyspark/context.py", line 223, in signal_handler
    raise KeyboardInterrupt()
KeyboardInterrupt

When I check my cluster UI, it says there are 3 functioning workers, but only 1 executor (the driver, associated with my master IP). I'm assuming this is a configuration issue.

My settings in spark-env.sh (master):

export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export SPARK_MASTER_IP=192.168.1.2

These settings are identical on each of the worker machines.

My settings in spark-defaults.conf (master):

spark.master    spark://lebron:7077
spark.serializer    org.apache.spark.serializer.KryoSerializer
spark.driver.memory 5g
spark.dynamicAllocation.enabled true
spark.shuffle.service.enabled   true
spark.kryoserializer.buffer.max 128m

Each worker has only spark.master and spark.serializer configuration options set as above.

I also need to figure out how to tune my memory management, since before this issue came up I was having Java heap space OOM exceptions thrown left and right when I should have had plenty of memory. But maybe I'll save that for a different question.

Please help!

1

1 Answers

0
votes

If you can find spark slaves in the web UI but they are not accepting jobs, there is a high chance that the firewall is blocking the communication.

You can do a test as in my other answer: Apache Spark on Mesos: Initial job has not accepted any resources