2
votes

Could you please help me to understand this extract, from Kafka Streams documentation:

Within a stream task that may be processing multiple topic-partitions, if users configure the application to not wait for all partitions to contain some buffered data and pick from the partition with the smallest timestamp to process the next record, then later on when some records are fetched for other topic-partitions, their timestamps may be smaller than those processed records fetched from another topic-partition.

I don't get the logic: if you pick the partition with the smallest timestamp, why would you get smaller timestamps later?

1

1 Answers

3
votes

I think the important part is

if users configure the application to not wait for all partitions to contain some buffered data

Since we don't wait for all partitions to have some data the following scenario can occur with streams S1 and S2 (T1 < T2):

  1. I2 with timestamp T2 arrives on S2
  2. the stream reads I2 since there's no other data available
  3. I1 with timestamp T1 arrives on S1
  4. the stream reads I1, which has a smaller timestamp than I2, which was already processed

If we waited until all streams had data we would select the smallest item from among all available items. Since we don't there is the possibility of items with smaller timestamps being delayed.