Spark on Mesos - running multiple Streaming jobs

Question

I have 2 spark streaming jobs that I want to run, as well as keeping some available resources for batch jobs and other operations.

I evaluated Spark Standalone cluster manager, but I realized that I would have to fix the resources for two jobs, which would leave almost no computing power to batch jobs.

I started evaluating Mesos, because it has "fine grained" execution model, where resources are shifted between Spark applications.

1) Does it mean that a single core can be shifted between 2 streaming applications?

2) Although I have spark & cassandra, in order to exploit data locality, do I need to have dedicated core on each of the slave machines to avoid shuffling?

3) Would you recommend running Streaming jobs in "fine grained" or "course grained" mode. I know that logical answer is course grained (in order to minimize the latency of streaming apps) but what when resource in total cluster are limited (cluster of 3 nodes, 4 cores each - there are 2 streaming applications to run and multiple time to time batch jobs)

4) In Mesos, when I run spark streaming job in cluster mode, will it occupy 1 core permanently (like Standalone cluster manager is doing), or will that core execute driver process and sometimes act as executor?

Thank you

Dean Wampler Dean Wampler · Accepted Answer · 2016-05-05T13:01:29

Fine grained mode is actually deprecated now. Even with it, each core is allocated to task until completion, but in Spark Streaming, each processing interval is a new job, so tasks only last as long the time it takes to process each interval's data. Hopefully that time is less than the interval time or your processing will back up, eventually running out of memory to store all those RDDs waiting for processing.

Note also that you'll need to have one core dedicated to each stream's Reader. Each will be pinned for the life of the stream! You'll need extra cores in case the stream ingestion needs to be restarted; Spark will try to use a different core. Plus you'll have a core tied up by your driver, if it's also running on the cluster (as opposed to on your laptop or something).

Still, Mesos is a good choice, because it will allocate the tasks to nodes that have capacity to run them. Your cluster sounds pretty small for what you're trying to do, unless the data streams are small themselves.

If you use the Datastax connector for Spark, it will try to keep input partitions local to the Spark tasks. However, I believe that connector assumes it will manage Spark itself, using Standalone mode. So, before you adopt Mesos, check to see if that's really all you need.

Spark on Mesos - running multiple Streaming jobs

1 Answers