Best practice to run multiple spark instance at a time in same jvm?

Question

I am trying to initiate separate pyspark application at a time from the Driver machine. So both applications are running in same JVM. Though it is creating separate spark context object but one of the job failed saying failed to get broadcast_1.

   16/12/06 08:18:40 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and
     have sufficient resources
    16/12/06 08:18:55 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and
     have sufficient resources
    16/12/06 08:18:59 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Registered executor NettyRpcEndpointRef(null) (172.26.7.195:44690) with ID 52
    16/12/06 08:18:59 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, 172.26.7.195, partition 0, ANY, 7307 bytes)
    16/12/06 08:18:59 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Launching task 0 on executor id: 52 hostname: 172.26.7.195.
    16/12/06 08:19:00 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Registered executor NettyRpcEndpointRef(null) (172.26.7.192:38343) with ID 53
    16/12/06 08:19:02 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, 172.26.7.195): java.io.IOException: org.apache.spark.SparkException: Failed
     to get broadcast_1_piece0 of broadcast_1
            at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1260)
            at org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:174)
            at org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:65)
            at org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:65)
            at org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:89)
            at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70)
            at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:67)
            at org.apache.spark.scheduler.Task.run(Task.scala:85)
            at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
            at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
            at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
            at java.lang.Thread.run(Thread.java:745)

I searched google a lot and even in stackoverflow and found that it is not recommended to run multiple spark context object in same JVM and it is not supported at all for python.

My queries are:

In my application, I need to run multiple pyspark application in same time in schedule. is there any way to run multiple pyspark application from spark driver at at time which will create separate sparkcontext object?
If first query answer is NO, then can I run for example one application from driver, another from executor but I can run it at a time.
Finally any other better suggestion in terms of configuration or best practice for parallel spark application running in same spark cluster.

My Setup:

Hadoop version: hadoop-2.7.1
Spark: 2.0.0
Python: python 3.4
Mongodb 3.2.10

Configuration: VM-1: Hadoop primary node, Spark driver & Executor, Mongodb VM-2: Hadoop data node, Spark Executor Pyspark application is running in normal crontab entry in VM-1

Ramzy Ramzy · Accepted Answer · 2016-12-08T15:07:16

Do you mean two spark applications or one spark application and two spark contexts. Two spark applications, each with their own driver and sparkcontext should be achievable unless you have to do something common as per your requirement.

When you have two spark applications, they are just like any other and the resources need to be shared like any other application

Best practice to run multiple spark instance at a time in same jvm?

3 Answers