Are Kafka and Kafka Streams right tools for our case?

Question

I'm new to Kafka and will be grateful for any advice We are updating a legacy application together with moving it from IBM MQ to something different.

Application currently does the following:

Reads batch XML messages (up to 5 MB)
Parses it to something meaningful
Processes data parallelizing this procedure somehow manually for parts of the batch. Involves some external legacy API calls resulting in DB changes
Sends several kinds of email notifications
Sends some reply to some other queue
input messages are profiled to disk

We are considering using Kafka with Kafka Streams as it is nice to

Scale processing easily
Have messages persistently stored out of the box
Built-in partitioning, replication, and fault-tolerance
Confluent Schema Registry to let us move to schema-on-write
Can be used for service-to-service communication for other applications as well

But I have some concerns.

We are thinking about splitting those huge messages logically and putting them to Kafka this way, as from how I understand it - Kafka is not a huge fan of big messages. Also it will let us parallelize processing on partition basis.

After that use Kafka Streams for actual processing and further on for aggregating some batch responses back using state store. Also to push some messages to some other topics (e.g. for sending emails)

But I wonder if it is a good idea to do actual processing in Kafka Streams at all, as it involves some external API calls?

Also I'm not sure what is the best way to handle the cases when this external API is down for any reason. It means temporary failure for current and all the subsequent messages. Is there any way to stop Kafka Stream processing for some time? I can see that there are Pause and Resume methods on the Consumer API, can they be utilized somehow in Streams?

Is it better to use a regular Kafka consumer here, possibly adding Streams as a next step to merge those batch messages together? Sounds like an overcomplication

Is Kafka a good tool for these purposes at all?

archie_by - Had a few quetions to further understand. Are you moving from IBM MQ to kafka because of scaling issue ? Are you planning to process millions of events instead of mission critical messages ? With IBM MQ, the MQ GET from a queue done by consumer is destructive read but in the case of kafka this logic would shift to the consumer to ensure it doesn't process the same event twice. — Rohan

wcarlson wcarlson · Accepted Answer · 2021-01-20T00:55:22

Overall I think you would be fine using Kafka and probably Kafka Streams as well. I would recommend using streams for any logic you need to do i.e. filtering or mapping that you have todo. Where you would want to write with a connector or a standard producer.

While it is ideal to have smaller messages I have seen streams users have messages in the GBs.

You can make remote calls, to send and email, from a Kafka Streams Processor but that is not recommend. It would probably be better to write the event to send an email to an output topic and use a normal consumer to read and send the messages. This would also take care of your concern about the API being down as you can always remember the last offset in case and restart from there. Or use the Pause and Resume methods.

Are Kafka and Kafka Streams right tools for our case?

1 Answers