I am about taking a decision about using Spark-Streaming Kafka integration.
I have a Kafka topic (I can break it into several topics) queuing several dozens of thousands of messages per minute, my spark streaming application ingest the messages by applying transformations, and then update a UI.
Knowing that all failures are handled and data are replicated in Kafka, what is the best option for implementing the Spark Streaming application in order to achieve the best possible performance and robustness:
- One Kafka topic and one Spark cluster.
- Several Kafka topics and several stand-alone Spark boxes (one machine with stand alone spark cluster for each topic)
- Several Kafka topics and one Spark cluster.
I am tempted to go for the second option, but I couldn't find people talking about such a solution.