We are currently ingesting Kafka messages into HDFS using Spark Streaming. So far we spawn a whole Spark job for each topic.
Since messages are produced pretty rarely for some topics (average of 1 per day), we're thinking about organising the ingestion in pools.
The idea is to avoid creating a whole container (and related resources) for this "unfrequent" topics. In fact Spark Streaming accepts a list of topics in input, so we're thinking about using this feature in order to have a single job consuming all of them.
Do you guys think the one exposed is a good strategy? We also thought about batch ingestion, but we like to keep real-time behavior so we excluded this option. Do you have any tip or suggestion?
Does Spark Streaming handle well multiple topics as a source in case of failures in terms of offset consistency etc.?
Thanks!