Use transactional API and exactly-once with regular Producers and Consumers

Question

Confluent documents that I was able to find all focus on Kafka Streams application when it comes to exactly-once/transactions/idempotence.

However, the APIs for transactions were introduced on a "regular" Producer/Consumer level and all the explanations and diagrams focus on them.

I was wondering whether it's Ok to use those API directly without Kafka Streams. I do understand the consequences of Kafka processing boundaries and the guarantees, and I'm Ok with violating it. I don't have a need for 100% exactly-once guarantee, it's Ok to have a duplicate once in a while, for example, when I read from/write to external systems. The problem I'm facing is that I need to create an ETL pipeline for Big Data project where we are getting a lot of duplicates when the apps are restated/relocated to different hosts automatically by Kubernetes. In general, it's not a problem to have some duplicates, it's a pipeline for analytics where duplicates are acceptable, but if the issue can be mitigated at least on the Kafka side - that would be great. Will using transactional API guarantee exactly-once for Kafka at least(to make sure that re-processing doesn't happen when reassignments/shut-downs/scaling activities are happening)?

Switching to Kafka Streams is not an option because we are quite late in the project.

JavaTechnical JavaTechnical · Accepted Answer · 2020-12-18T11:08:15

Exactly-once semantics is achievable with regular producers and consumers also. Kafka Streams are built on top of these clients themselves.

We can use an idempotent producer to do achieve this.

When dealing with external systems, it is important to ensure that we don't produce the same message again and again using producer.send(). Idempotence applies to internal retries by Kafka clients but doesn't take care of duplicate calls to send().

When we produce messages that arrive from a source we need to ensure that the source doesn't produce a duplicate message. For example, if it is a database, use a WAL and last maintain last read offset for that WAL and restart from that point. Debezium, for example does that. You may check to see if it supports your datasource.

Use transactional API and exactly-once with regular Producers and Consumers

1 Answers