Kafka: topic compaction notification?

Question

I was given the following architecture that I'm trying to improve.

I receive a stream of DB changes which end up in a compacted topic. The stream is basically key/value pairs and the keyspace is large (~4 GB).

The topic is consumed by one kafka stream process that stores the data in RockDB (separate for each consumer/shard). The processor does two different things:

join the data into another stream.
check if a message from the topic is a new key or an update to an existing one. If it is an update it sends the old key/value and the new key/value pair to a different topic (updates are rare).

The construct has a couple of problems:

The two different functionalities of the stream processor belong to different teams and should not be part of the same code base. They are put together to save memory. If we separate it we would have to duplicate RockDB's.
I would prefer to use a normal KTable join instead of the handcrafted join that's currently in the code.
RockDB seems to be a bit of overkill if the data is already persisted in a topic. We currently running into some performance issues and I assume it would be faster if we just keep everything in memory.

Question 1: Is there a way to hook into the compaction process of a compacted topic? I would like a notification (to a different topic) for every key that is actually compacted (including the old and new value). If this is somehow possible I could easily split the code bases apart and simplify the join.

Question 2: Any other idea on how this can be solved more elegantly?

Matthias J. Sax Matthias J. Sax · Accepted Answer · 2018-04-05T18:47:27

You overall design makes sense.

About your join semantics: I guess you need to stick with Processor API as regular KTable cannot provide you want. It's also not possible to hook into the compaction process.

However, Kafka Streams also supports in-memory state stores: https://kafka.apache.org/documentation/streams/developer-guide/processor-api.html#state-stores

RocksDB is used by default, to allow the state to be larger than available main-memory. Spilling to disk with RocksDB to reliability -- however, it also has the advantage, that stores can be recreated quicker if an instance come back online on the same machine, as it's not required to re-read the whole changelog topic.

If you want to split the app into two, is your own decision on how much resources you want to provide.

Kafka: topic compaction notification?

1 Answers