How to interpret kafka broker reported latency metrics

Question

I am looking at the kafka broker reported various latency metrics to include them on grafana dashboard but i have difficulty understanding the reported metrics. i have exported the metrics to prometheus through JMX exporter. for e.g let's take the Produce Request Total time metric. (kafka.network:type=RequestMetrics,name=TotalTimeMs,request=Produce)

when i query prometheus with "kafka_network_requestmetrics_totaltimems_count{request="Produce"}"

, i get some large number. e.g. 56459366. what is big number mean ?

when i query prometheus with "kafka_network_requestmetrics_totaltimems{request="Produce"}", i get 6 rows. e.g. following

kafka_network_requestmetrics_totaltimems{instance="10.130.12.24:8020",job="kubernetes-pods",pod="kafka-0",quantile="0.50",request="Produce"}    2
kafka_network_requestmetrics_totaltimems{instance="10.130.12.24:8020",job="kubernetes-pods",pod="kafka-0",quantile="0.75",request="Produce"}    2
kafka_network_requestmetrics_totaltimems{instance="10.130.12.24:8020",job="kubernetes-pods",pod="kafka-0",quantile="0.95",request="Produce"}    3
kafka_network_requestmetrics_totaltimems{instance="10.130.12.24:8020",job="kubernetes-pods",pod="kafka-0",quantile="0.98",request="Produce"}    12.42
kafka_network_requestmetrics_totaltimems{instance="10.130.12.24:8020",job="kubernetes-pods",pod="kafka-0",quantile="0.99",request="Produce"}    21
kafka_network_requestmetrics_totaltimems{instance="10.130.12.24:8020",job="kubernetes-pods",pod="kafka-0",quantile="0.999",request="Produce"} 54

what these various quantile metric mean and how i can calculate the avg values out of these ?

how frequently these metrics are updated by the broker ?

Lior Chaga Lior Chaga · Accepted Answer · 2020-11-03T21:50:19

The count is just the number of Produce requests that were measured since the broker went up. For every produce request the broker gets, it measures the time to handle. So it's a monotonically increasing counter.

The different 6 rows are percentiles. It means, in your case, that for 50% (median) of the produce requests, time for handling them took up to 2ms. Same goes for 75% of requests. However, for 99% of your requests, time for handling took up to 21ms. So you can deduce that for those 24% requests in the middle, handling time took between 2ms to 21ms. You can't and shouldn't calculate the avg, as it's quite misleading, especially when measuring SLA (as the famous joke says - if a statistician head is in the stove, and it's legs are in the freezer, then on average he feels fine...) There are many posts you can find explaining the difference, here's one for example - https://www.dynatrace.com/news/blog/why-averages-suck-and-percentiles-are-great/

As for the frequency these metrics are updated - they are constantly updated as requests are coming in. Histograms are using reservoirs, to give more weight to recent samples (there's no point taking into consideration samples that occurred a week ago when you look at the current request time percentiles). There are different types of reservoirs, I don't know which one is used here, but for the sake of understanding the concept, you can read this post https://medium.com/expedia-group-tech/your-latency-metrics-could-be-misleading-you-how-hdrhistogram-can-help-9d545b598374

How to interpret kafka broker reported latency metrics

1 Answers