0
votes

Avro encoded messages on a single Kafka topic, single partitioned. Each of these messages were to be consumed by a specific consumer only. For ex, message a1, a2, b1 and c1 on this topic, there are 3 consumers named A, B and C. Each consumer would get all the messages but ultimate A would consume a1 and a2, B on b1 and C on c1.

I want to know how typically this is solved when using avro on Kafka:

  1. leave it for the consumers to deserialize the message then some application logic to decide to consume the message or drop the message
  2. use partition logic to make each of the messages to go to a particular partition, then setup each consumer to listen to only a single partition
  3. setup another 3 topics and a tiny kafka-stream application that would do the filtering + routing from main topic to these 3 specific topics
  4. make use of kafka header to inject identifier for downstream consumers to filter

Looks like each of the options have their pros and cons. I want to know if there is a convention that people follow or there is some other ways of solving this.

2

2 Answers

0
votes

It depends...

If you only have a single partitioned topic, the only option is to let each consumer read all data and filter client side which data the consumer is interested in. For this case, each consumer would need to use a different group.id to isolate the consumers from each other.

Option 2 is certainly possible, if you can control the input topic you are reading from. You might still have different group.ids for each consumer as it seems that the consumer represent different applications that should be isolated from each other. The question is still if this is a good model, because the idea of partitions is to provide horizontal scale out, and data-parallel processing; however, if each application reads only from one partition it seems not to align with this model. You also need to know which data goes into which partition producer side and consumer side to get the mapping right. Hence, it implies a "coordination" between producer and consumer what seems not desirable.

Option 3 seems to indicate that you cannot control the input topic and thus want to branch the data into multiple topics? This seems to be a good approach in general, as topics are a logical categorization of data. However, it would even be better to have 3 topic for the different data to begin with! If you cannot have 3 input topic from the beginning on, Option 3 seems not to provide a good conceptual setup, however, it won't provide much performance benefits, because the Kafka Streams application required to read and write each record once. The saving you gain is that each application would only consume from one topic and thus redundant data read is avoided here -- if you would have, lets say 100 application (and each is only interested in 1/100 of the data) you would be able to cut down the load significantly from an 99x read overhead to a 1x read and 1x write overhead. For your case you don't really cut down much as you go from 2x read overhead to 1x read + 1x write overhead. Additionally, you need to manage the Kafka Streams application itself.

Option 4 seems to be orthogonal, because is seems to answer the question on how the filtering works, and headers can be use for Option 1 and Option 3 to do the actually filtering/branching.

-1
votes

The data in the topic is just bytes, Avro shouldn't matter.

Since you only have one partition, only one consumer of a group can be actively reading the data.

If you only want to process certain offsets, you must either seek to them manually or skip over messages in your poll loop and commit those offsets