Spark Streaming and Kafka: one cluster or several standalone boxes?

Question

I am about taking a decision about using Spark-Streaming Kafka integration.

I have a Kafka topic (I can break it into several topics) queuing several dozens of thousands of messages per minute, my spark streaming application ingest the messages by applying transformations, and then update a UI.

Knowing that all failures are handled and data are replicated in Kafka, what is the best option for implementing the Spark Streaming application in order to achieve the best possible performance and robustness:

One Kafka topic and one Spark cluster.
Several Kafka topics and several stand-alone Spark boxes (one machine with stand alone spark cluster for each topic)
Several Kafka topics and one Spark cluster.

I am tempted to go for the second option, but I couldn't find people talking about such a solution.

maasg maasg · Accepted Answer · 2016-01-05T09:55:25

An important element to consider in this case is the partitioning of the topic.

The parallelism level of your Kafka-Spark integration will be determined by the number of partitions of the topic. The direct Kafka model simplifies the consumption model by establishing a 1:1 mapping between the number of partitions of the topic and RDD partitions for the corresponding Spark job.

So, the recommended setup would be: one Kafka topic with n partitions (where n is tuned for your usecase) and a Spark cluster with enough resources to process the data from those partitions in parallel.

Option #2 feels like trying to re-implement what Spark gives you out of the box: Spark gives you resilient distributed computing. Option #2 is trying to parallelize the payload over several machines and deal with failure by having independent executors. You get that with a single Spark cluster, with the benefit of improved resource usage and a single deployment.

Spark Streaming and Kafka: one cluster or several standalone boxes?

2 Answers