1
votes

I have Apache Beam 2.2.0 realtime job, which is running on Google Dataflow platform. The job basically reads json events from Kafka, transforms it and writes to BigQuery. The problem is that Google Dataflow runner constantly closes existing Kafka consumer and then creates new one with the following message:

logger: "com.google.cloud.dataflow.worker.ReaderCache"
message: "Closing idle reader for S4-0000000000000014"
stage: "S4"
step: "Read from Kafka/KafkaIO.Read/KafkaIO.Read/Read(UnboundedKafkaSource)/DataflowRunner.StreamingUnboundedRead.ReadWithIds"

However, according to monitoring of Kafka consumers offset lag, there are plenty of messages in Kafka to consumer.

The question is why Google Dataflow do this? How does it determine that reader is idle? How to prevent this behaviour?

1

1 Answers

3
votes

Dataflow worker closes a reader like KafkaReader if it is idle (i.e. not used by the worker) for 1 minute. In this case, 'idle' is not w.r.t to Kafka servers are messages left to read from them. It implies the Dataflow worker didn't try to consume any messages from the reader for 1 minute. The common reason is that pipeline is busy doing other work in different stages. E.g. you might have quite a bit of processing after an aggregation, which case, Dataflow is busy working on that stage.

This one minute timeout is currently not configurable. Do you notice an any issue as a result of closing it? It will be reopened when the reader is used again.