Kafka Streams DSL over Kafka Consumer API

3

votes

Recently, in an interview, I was asked a questions about Kafka Streams, more specifically, interviewer wanted to know why/when would you use Kafka Streams DSL over plain Kafka Consumer API to read and process streams of messages? I could not provide a convincing answer and wondering if others with using these two styles of stream processing can share their thoughts/opinions. Thanks.

apache-kafkakafka-consumer-apiapache-kafka-streams

A simple reply is that Consumers cannot easily join different topics. – OneCricketeer

3

votes

As usual it depends on the use case when to use KafkaStreams API and when to use plain KafkaProducer/Consumer. I would not dare to select one over the other in general terms.

First of all, KafkaStreams is build on top of KafkaProducers/Consumers so everything that is possible with KafkaStreams is also possible with plain Consumers/Producers.

I would say the KafkaStreams API is less complex but also less flexible compared to the plain Consumers/Producers. Now we could start long discussions on what means "less".

When it comes to developing Kafka Streams API you can directly jump into your business logic applying methods like filter, map, join, or aggregate because all the consuming and producing part is abstracted behind the scenes.

When you are developing applications with plain Consumer/Producers you need to think about how you build your clients at the level of subscribe, poll, send, flush etc.

If you want to have even less complexity (but also less flexibilty) ksqldb is another option you can choose to build your Kafka applications.

2

votes

Here are some of the scenarios where you might prefer the Kafka Streams over the core Producer / Consumer API:

It allows you to build a complex processing pipeline with much ease. So. let's assume (a contrived example) you have a topic containing customer orders and you want to filter the orders based on a delivery city and save them into a DB table for persistence and an Elasticsearch index for quick search experience. In such a scenario, you'd consume the messages from the source topic, filter out the unnecessary orders based on city using the Streams DSL filter function, store the filter data to a separate Kafka topic (using KStream.to() or KTable.to()), and finally using Kafka Connect, the messages will be stored into the database table and Elasticsearch. You can do the same thing using the core Producer / Consumer API also, but it would be much more coding.
In a data processing pipeline, you can do the consume-process-produce in a same transaction. So, in the above example, Kafka will ensure the exactly-once semantics and transaction from the source topic up to the DB and Elasticsearch. There won't be any duplicate messages introduced due to network glitches and retries. This feature is especially useful when you are doing aggregates such as the count of orders at the level of individual product. In such scenarios duplicates will always give you wrong result.
You can also enrich your incoming data with much low latency. Let's assume in the above example, you want to enrich the order data with the customer email address from your stored customer data. In the absence of Kafka Streams, what would you do? You'd probably invoke a REST API for each incoming order over the network which will be definitely an expensive operation impacting your throughput. In such case, you might want to store the required customer data in a compacted Kafka topic and load it in the streaming application using KTable or GlobalKTable. And now, all you need to do a simple local lookup in the KTable for the customer email address. Note that the KTable data here will be stored in the embedded RocksDB which comes with Kafka Streams and also as the KTable is backed by a Kafka topic, your data in the streaming application will be continuously updated in real time. In other words, there won't be stale data. This is essentially an example of materialized view pattern.
Let's say you want to join two different streams of data. So, in the above example, you want to process only the orders that have successful payments and the payment data is coming through another Kafka topic. Now, it may happen that the payment gets delayed or the payment event comes before the order event. In such case, you may want to do a one hour windowed join. So, that if the order and the corresponding payment events come within a one hour window, the order will be allowed to proceed down the pipeline for further processing. As you can see, you need to store the state for a one hour window and that state will be stored in the Rocks DB of Kafka Streams.

Kafka Streams DSL over Kafka Consumer API

2 Answers