Spark Streaming Design for 1000+ topics

Question

I have to design a spark streaming application with below use case. I am looking for best possible approach for this.

I have application which pushing data into 1000+ different topics each has different purpose . Spark streaming will receive data from each topic and after processing it will write back to corresponding another topic.

Ex. 

Input Type 1 Topic  --> Spark Streaming --> Output Type 1 Topic 
Input Type 2 Topic  --> Spark Streaming --> Output Type 2 Topic 
Input Type 3 Topic  --> Spark Streaming --> Output Type 3 Topic 
.
.
.
Input Type N Topic  --> Spark Streaming --> Output Type N Topic  and so on.

I need to answer following questions.

Is it a good idea to launch 1000+ spark streaming application per topic basis ? Or I should have one streaming application for all topics as processing logic going to be same ?
If one streaming context , then how will I determine which RDD belongs to which Kafka topic , so that after processing I can write it back to its corresponding OUTPUT Topic?
Client may add/delete topic from Kafka , how do dynamically handle in Spark streaming ?
How do I restart job automatically on failure ?

Any other issue you guys see here ?

Highly appreicate your response.

I think you should look at Kafka Streams for such usecase. confluent.io/blog/… — maasg

Paul Leclercq Paul Leclercq · Accepted Answer · 2017-06-13T21:32:08

1000 different Spark applications will not be maintainable, imagine deploying, or upgrading each application.

You will have to use the recommended "Direct approach" instead of the Receiver approach, otherwise your application is going to use more than 1000 cores, if you don't have more, it will be able to receive data from your Kafka's topic but not to process them. From Spark Streaming Doc :

Note that, if you want to receive multiple streams of data in parallel in your streaming application, you can create multiple input DStreams (discussed further in the Performance Tuning section). This will create multiple receivers which will simultaneously receive multiple data streams. But note that a Spark worker/executor is a long-running task, hence it occupies one of the cores allocated to the Spark Streaming application.

You can see in the Kafka Integration (there is one for Kafka 0.8 and one for 0.10) doc how to see in which topic belongs a message
If a client add new topics or partitions, you will need to update your Spark Streaming's topics conf, and redeploy it. If you use Kafka 0.10 you can also use RegEx for topics' name, see Consumer Strategies. I've experienced reading from a deleted topic in Kafka 0.8, and there was no problems, still verify ("Trust, but verify")
See Spark Streaming's doc about Fault Tolerance, also use the mode --supervise when submiting your application to your cluster, see the Deploying documentation for more information

To achieve exactly-once semantic, I suggest this Github from Spark Streaming's main commiter : https://github.com/koeninger/kafka-exactly-once

Bonus, good similar StackOverFlow post : Spark: processing multiple kafka topic in parallel

Bonus2: Watch out for the soon-to-be-released Spark 2.2 and the Structured Streaming component

Spark Streaming Design for 1000+ topics

1 Answers