I am trying to initiate separate pyspark application at a time from the Driver machine. So both applications are running in same JVM. Though it is creating separate spark context object but one of the job failed saying failed to get broadcast_1.
16/12/06 08:18:40 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and
have sufficient resources
16/12/06 08:18:55 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and
have sufficient resources
16/12/06 08:18:59 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Registered executor NettyRpcEndpointRef(null) (172.26.7.195:44690) with ID 52
16/12/06 08:18:59 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, 172.26.7.195, partition 0, ANY, 7307 bytes)
16/12/06 08:18:59 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Launching task 0 on executor id: 52 hostname: 172.26.7.195.
16/12/06 08:19:00 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Registered executor NettyRpcEndpointRef(null) (172.26.7.192:38343) with ID 53
16/12/06 08:19:02 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, 172.26.7.195): java.io.IOException: org.apache.spark.SparkException: Failed
to get broadcast_1_piece0 of broadcast_1
at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1260)
at org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:174)
at org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:65)
at org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:65)
at org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:89)
at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:67)
at org.apache.spark.scheduler.Task.run(Task.scala:85)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
I searched google a lot and even in stackoverflow and found that it is not recommended to run multiple spark context object in same JVM and it is not supported at all for python.
My queries are:
In my application, I need to run multiple pyspark application in same time in schedule. is there any way to run multiple pyspark application from spark driver at at time which will create separate sparkcontext object?
If first query answer is NO, then can I run for example one application from driver, another from executor but I can run it at a time.
Finally any other better suggestion in terms of configuration or best practice for parallel spark application running in same spark cluster.
My Setup:
- Hadoop version: hadoop-2.7.1
- Spark: 2.0.0
- Python: python 3.4
- Mongodb 3.2.10
Configuration: VM-1: Hadoop primary node, Spark driver & Executor, Mongodb VM-2: Hadoop data node, Spark Executor Pyspark application is running in normal crontab entry in VM-1