0
votes

We are currently ingesting Kafka messages into HDFS using Spark Streaming. So far we spawn a whole Spark job for each topic.

Since messages are produced pretty rarely for some topics (average of 1 per day), we're thinking about organising the ingestion in pools.

The idea is to avoid creating a whole container (and related resources) for this "unfrequent" topics. In fact Spark Streaming accepts a list of topics in input, so we're thinking about using this feature in order to have a single job consuming all of them.

Do you guys think the one exposed is a good strategy? We also thought about batch ingestion, but we like to keep real-time behavior so we excluded this option. Do you have any tip or suggestion?

Does Spark Streaming handle well multiple topics as a source in case of failures in terms of offset consistency etc.?

Thanks!

1
Personally, I would use a Kafka Connect cluster rather than tune Spark code - OneCricketeer
Good point. But we tend to exclude Kafka Connect for a couple of reasons: it seems only Confluent implementation exists out there that handles Avro serialization only. Furthermore, our own implementation would give us full flixibility. Also, we would like to handle these integration jobs in our cluster with our own scheduler not adding another technology in the stack. - user2274307
Suggest you follow 007's advice - thebluephantom
It's not exclusive to Confluent. It's entirely plugin-based. Just a thought, if you want to have your Spark cluster have more resources open for other tasks - OneCricketeer
Thanks @cricket_007, but there's only the Confluent connector I can find out there. I'm afraid about the license and stuff. Do you know a different connector? Could you link it please? - user2274307

1 Answers

1
votes

I think Spark should be able to handle multiple topics fine as they have support for this from a long time and yes Kafka connect is not confluent API. Confluent does provide connectors for their cluster but you can use it too. You can see that Apache Kafka also has documentation for Connect API.

It is little difficult with Apache version of Kafka, but you can use it.

https://kafka.apache.org/documentation/#connectapi

Also if you're opting for multiple kafka topics in single spark streaming job, you may need to think about not creating small files as your frequency seems very less.