2
votes

I am trying to initiate separate pyspark application at a time from the Driver machine. So both applications are running in same JVM. Though it is creating separate spark context object but one of the job failed saying failed to get broadcast_1.

   16/12/06 08:18:40 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and
     have sufficient resources
    16/12/06 08:18:55 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and
     have sufficient resources
    16/12/06 08:18:59 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Registered executor NettyRpcEndpointRef(null) (172.26.7.195:44690) with ID 52
    16/12/06 08:18:59 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, 172.26.7.195, partition 0, ANY, 7307 bytes)
    16/12/06 08:18:59 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Launching task 0 on executor id: 52 hostname: 172.26.7.195.
    16/12/06 08:19:00 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Registered executor NettyRpcEndpointRef(null) (172.26.7.192:38343) with ID 53
    16/12/06 08:19:02 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, 172.26.7.195): java.io.IOException: org.apache.spark.SparkException: Failed
     to get broadcast_1_piece0 of broadcast_1
            at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1260)
            at org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:174)
            at org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:65)
            at org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:65)
            at org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:89)
            at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70)
            at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:67)
            at org.apache.spark.scheduler.Task.run(Task.scala:85)
            at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
            at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
            at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
            at java.lang.Thread.run(Thread.java:745)

I searched google a lot and even in stackoverflow and found that it is not recommended to run multiple spark context object in same JVM and it is not supported at all for python.

My queries are:

  1. In my application, I need to run multiple pyspark application in same time in schedule. is there any way to run multiple pyspark application from spark driver at at time which will create separate sparkcontext object?

  2. If first query answer is NO, then can I run for example one application from driver, another from executor but I can run it at a time.

  3. Finally any other better suggestion in terms of configuration or best practice for parallel spark application running in same spark cluster.

My Setup:

  1. Hadoop version: hadoop-2.7.1
  2. Spark: 2.0.0
  3. Python: python 3.4
  4. Mongodb 3.2.10

Configuration: VM-1: Hadoop primary node, Spark driver & Executor, Mongodb VM-2: Hadoop data node, Spark Executor Pyspark application is running in normal crontab entry in VM-1

3

3 Answers

1
votes

Do you mean two spark applications or one spark application and two spark contexts. Two spark applications, each with their own driver and sparkcontext should be achievable unless you have to do something common as per your requirement.

When you have two spark applications, they are just like any other and the resources need to be shared like any other application

2
votes

I was also trying to do the similar things, and got the block manager registration error. I was trying to launch 2 different pyspark shells from the same node, after many searches I realized that maybe both the pyspark shells are using the same driver JVM, and as one shell occupy the BlockManager for the other, the other shell started giving exception.

So I decided to use another approach, where I was using different nodes to launch the driver programs and link both the programs with the same master using

pyspark --master <spark-master url> --total-executor-cores <num of cores to use>  

Now I am no longer getting the error.

Hope this helps, and do tell if you find any reason or solution to launch more than one spark-shells in the same driver.

0
votes

"WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources"

The driver is allocated resources in order to run, and the remaining resources are less than those specified for your application executors.

EG:
Node has 4 Cores x 16GB RAM and
Driver Configuration is Spark Driver Cores = 1, Spark Driver Memory = 8GB
Executor Configuration is Spark Executor Cores = 4, Spark Executor Memory = 10GB
This will result in the error above.

The Driver resources + Executor resources cannot exceed the limit of the node (as determined by either physical hardware or spark-env settings)

In the above example:
Driver configured to use 1 CPU Core / 8 GB RAM
The Executor configuration cannot exceed 3 CPU Cores / 8 GB RAM

Note that the total executor resources will be
(Spark Executor Cores/ Executor Memory) * number of executors running on the node