Prometheus data gaps when monitoring app under heavy load

Question

spring boot + spring integration app monitored by prometheus throught the build in micrometer.io. the spring boot app will expose locahost:8080/actuator/prometheus. the monitoring data arrives in prometheus and can be displayed as a graph. this is working fine.

my problem is that i get some gaps in the prometheus data. these gaps happen when the app is under heavy load. it is normal when the app is very busy that the response times for locahost:8080/actuator/prometheus get longer. in my case without load is less then 1 second, but with load gets around 1 minute. the target is shown in the prometheus status->targets as offline. one possibility would be to set the scrape_interval = 2min but it would be important to see more detail info.

my question: is there a solution for this scenario? (setting priority to monitoring url?, storing temporary the info in the spring boot app and send it later)

update: i am trying to monitor the spring integration metrics, but for this question is not important which metric. could be anything like jvm heap.

Jens Baitinger Jens Baitinger · Accepted Answer · 2021-04-03T10:25:35

Under normal circumstances querying the metrics endpoint using is quite fast.

There are three scenarios that came to my mind that could be the reason why its getting slower:

a) your app is so much under heavy load that it takes too much time until it accepts the http request. This means that your app serves too many requests then it can handle. In that case give it more resources, threads or whatever is the bottleneck. (see here)

b) you have custom gauges registered that needs lots of time to calculate or the get the value. E.g. having a DB query in a Gauge getter function is a killer, as every time the metric endpoint is queried, your app needs to query the database and only then it can render the metrics. Even worse if you have multiple of these (which are handled sequencially) and their performance is dependent on your applications load (e.g. when the DB server gets slower when your app is under heavy load, this will make it worse)

c) Your metrics labels cardinality are dependent on your application usage (which as a bad practice). E.g. having a label for each user or each session will increase the amount of metrics when your application is under heavy usage. This will not only stress your application (as each metric needs some memory) but it will also stress your Prometheus server as it creates files for each unique label value combination.

What you could do, but this will not solve the cause of your issues is increasing the value for scrape_timeout (see here).

Prometheus data gaps when monitoring app under heavy load

1 Answers