0
votes

I have set up a monitoring system using prometheus, with AWS EC2 Auto discovery, and node exporter. Using the following formula to get CPU Utilization:

100 - (avg by (instance) (irate(node_cpu_seconds_total{instance="instancexyz" ,mode="idle"}[5m])) * 100)

However, in one particular ASG, I am getting CPU percentage in large negative values. I opened the instance:9100/metrics link and found the idle values to be large exponential values. Here is one value I got:

node_cpu_seconds_total{cpu="0",mode="idle"} 4.25766215e+06

The metrics are working fine on all my instances except for few ones. Any idea what is happening?

1

1 Answers

5
votes

Those "large exponential values" you are seeing are cumulative. I.e. the CPU (or all CPU cores?) have been idle for 1000+ hours (4.25e6 / 3600) since the VM was started, so they look very reasonable.

The reason you are getting negative values is because of sampling. In theory all samples are exactly scrape_interval seconds apart to the millisecond and the network latency and exporter processing time for each scrape is exactly the same. In practice, scrapes may be delayed or even skipped, network latency varies and your target VM can have its CPU pegged every now and then (or hang for any reason).

Meaning that e.g. it's perfectly possible for one sample of node_cpu_seconds_total to have value V at (nominally) time T and value V + 1 at (nominally) time T + 10s, resulting in an idle ratio of 110%. Or whatever values you care to come up with. irate exacerbates this problem because it always looks at two successive samples, increasing the relative measurement error (the error relative to the time between the samples).

There's nothing you can do about this except accept it's not a perfect measurement and slap a clamp_min(<your_expression>, 0) on top of it. Using rate instead of irate might also reduce the error and is generally a good idea unless you're looking at your data at full resolution.