4
votes

I configured alerting in Grafana yesterday and get from two servers alerts. It's always the same two servers which got high IO, high CPU or anything else.

The thing is, they do not have such high data. In fact they're almost on idle. All servers are configured exactly the same via Ansible. So the Telegraf config is the same on all servers.

Also if I filter the stats in Grafana to the corresponding server the data displayed in the graph is correct as you can see in the screenshot below. Still the Rule-Test results in a false positive.

Screenshot of Grafana Graph of server with correct data and 'Test Rule' with wrong result

I checked vmstat which also displays correct informations:

procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 1  0  47100 151152  20948 454556    2    2    16    38    2    1  2  1 96  0  1
 0  0  47100 151136  20948 454592    0    0     0     0  125  135  0  1 96  0  2
 0  0  47100 150408  20956 454584    0    0     0    84  222  282  1  3 93  0  4
 0  0  47100 150424  20956 454592    0    0     0     0  151  225  0  0 97  0  2
 0  0  47100 150424  20956 454592    0    0     0     0  115  140  0  0 96  0  4
 0  0  47100 150424  20956 454592    0    0     0     0  109  125  0  0 97  0  2
 0  0  47100 150424  20956 454592    0    0     0     0  121  131  0  0 98  0  2
 0  0  47100 150412  20972 454576    0    0     0    92  139  208  0  1 96  0  3
 0  0  47100 150456  20972 454592    0    0     0     0   65  117  0  0 99  0  1
 0  0  47100 150876  20972 454592    0    0     0    16  692  705  2  4 88  0  5

And the telegraf.log if something's wrong.

2017-07-07T09:22:04Z I! Starting Telegraf (version 1.3.3)
2017-07-07T09:22:04Z I! Loaded outputs: influxdb
2017-07-07T09:22:04Z I! Loaded inputs: inputs.diskio inputs.processes inputs.swap inputs.system inputs.redis inputs.disk inputs.kernel inputs.mem inputs.net inputs.nginx inputs.postgresql inputs.cpu
2017-07-07T09:22:04Z I! Tags enabled: environment=production host=om-1-prod rails_env=production role=telegraf
2017-07-07T09:22:04Z I! Agent Config: Interval:10s, Quiet:false, Hostname:"om-1-prod", Flush Interval:10s

Any ideas whats wrong here?

1

1 Answers

0
votes

I kept monitoring the servers manually and found these high peaks for a short period of time.

So the issue here is that these peaks aren't visible in the selected range of time within Grafana. It gets aggregated to a smaler average which then looks like there only have been 40 ips. If I zoom into the corresponding time range I see these peaks.

Long story short: There's no issue witch Grafana, Telegraf of InfluxDB. The problem existed between keyboard and chair.