I'm new to Kafka and will be grateful for any advice We are updating a legacy application together with moving it from IBM MQ to something different.
Application currently does the following:
- Reads batch XML messages (up to 5 MB)
- Parses it to something meaningful
- Processes data parallelizing this procedure somehow manually for parts of the batch. Involves some external legacy API calls resulting in DB changes
- Sends several kinds of email notifications
- Sends some reply to some other queue
- input messages are profiled to disk
We are considering using Kafka with Kafka Streams as it is nice to
- Scale processing easily
- Have messages persistently stored out of the box
- Built-in partitioning, replication, and fault-tolerance
- Confluent Schema Registry to let us move to schema-on-write
- Can be used for service-to-service communication for other applications as well
But I have some concerns.
We are thinking about splitting those huge messages logically and putting them to Kafka this way, as from how I understand it - Kafka is not a huge fan of big messages. Also it will let us parallelize processing on partition basis.
After that use Kafka Streams for actual processing and further on for aggregating some batch responses back using state store. Also to push some messages to some other topics (e.g. for sending emails)
But I wonder if it is a good idea to do actual processing in Kafka Streams at all, as it involves some external API calls?
Also I'm not sure what is the best way to handle the cases when this external API is down for any reason. It means temporary failure for current and all the subsequent messages. Is there any way to stop Kafka Stream processing for some time? I can see that there are Pause and Resume methods on the Consumer API, can they be utilized somehow in Streams?
Is it better to use a regular Kafka consumer here, possibly adding Streams as a next step to merge those batch messages together? Sounds like an overcomplication
Is Kafka a good tool for these purposes at all?