1
votes

We use Azure Event Hubs with the Kafka integration option. Our services are on Java, Spring Boot, Spring Cloud Stream. They are deployed on Azure AKS. We have enabled service endpoints on the virtual network of the cluster for Azure Event Hubs.

Most of the time, everything works fine.

From time to time, the producers cannot publish to Kafka. We lose messages, which are usually critical for the overall data consistency.

When that happens, we see some errors in the logs (I've broken them down to multiple lines for readability):

First example from the logs:

2019-02-21 22:11:04.681 WARN 1 --- [ad | producer-2]
o.a.k.clients.producer.internals.Sender : [Producer clientId=producer-2]
Got error produce response with correlation id 6 on topic-partition _topic-name_-1,
retrying (4 attempts left). Error: NETWORK_EXCEPTION

Second example:

org.apache.kafka.common.errors.TimeoutException:
Expiring 1 record(s) for _topic-name_-1:
30096 ms has passed since batch creation plus linger time

The consumers also experience occasional connectivity issues:

2019-02-22 03:03:59.733 INFO 1 --- [container-0-C-1]
o.a.k.c.c.internals.AbstractCoordinator :
[Consumer clientId=consumer-6, groupId=my-super-service]
Group coordinator my-super-hub.servicebus.windows.net:9093
(id: 2147483647 rack: null) is unavailable or invalid, will attempt rediscovery

Does anyone have similar issues with Azure Event Hub and perhaps some ideas on what might be the problem?

1
Hi Nikolaos, were you able to figure this out? I'm currently observing similar issues.Muton
Hi Muton, I have opened a support ticket with Azure and I've been told to increase the timeout request.timeout.ms property. This didn't really help but I'm experimenting with other properties like batch.size... I'm still monitoring the situation but it seems that batch.size zero is helping. Still not sure if it's really fixed.Nikolaos Georgiou
It did not help. We have experimented tweaking various settings but we still get these errors. I got a reply from Microsoft support that this is actually expected when the connection has been idling for a long time (that's our case): "I discussed this with my Subject Matter Experts team, and they said what you are experiencing Is expected when the connection is idle for a certain period, which is expected behavior in Event Hub."Nikolaos Georgiou

1 Answers

0
votes

You will need to set the max connection idle time.

connections.max.idle.ms

Good luck.