1
votes

I have run into a bit of a trouble for what is seems to be an easy question.

My scenario: I have a k8s job which can be run at any time (not a cronJob) which in turn creates a pod to perform some tasks. Once the pod performs its task it completes, thus completing the job that spawned it.

What I want: I want to alert via prometheus if the pod is in a running state for more than 1h signalling that the task is taking too much time. I'm interested to alert ONLY when duration symbolised by the arrow in the attached image exceeds 1h. Also have no alerts triggered when the pod is no longer running.enter image description here

What I tried: The following prometheus metric, which is an instant vector that can be either 0(pod not running) or 1(pod is running):

kube_pod_status_ready{condition="true",pod_name=~".+POD-A.+"}

I figured I tried to use this metric with the following formula for computing the duration for when the metric was one during a day

(1 - avg_over_time(kube_pod_status_ready{condition="true",pod_name=~".+POD-A.+"}[1d])) * 86400 > 3600

Because these pods come and go and are not always present I'm encountering the following problems:

  • The expr above starts from the 86400 value and eventually drops once the container is running this would trigger an alert
  • The pod eventually goes away and I would not like to send out fake alerts for pods which are no longer running(although they took over 1h to run)
1
Can you try this: sum_over_time(kube_pod_status_ready{condition="true",pod_name=~".+POD-A.+"}[1h:1s]) == 3600 ?Matt
@HelloWorld thanks. This looks to be the best solution so far. I will post a complete answer on this.Paul Chibulcuteanu

1 Answers

2
votes

Thanks to the suggestion of @HelloWorld i think this would be the best solution to achieve what I wanted:

(sum_over_time(kube_pod_status_ready{condition="true",pod_name=~".+POD-A.+"}[1d:1s]) > 3600) and (kube_pod_status_ready{condition="true",pod_name=~".+POD-A.+"}==1)
  • Count the number of times pods is running in the past day/6h/3h and verify if that exceeds 1h(3600s) AND
  • Check if the pod is still running - so that it doesn't take into consideration old pods or if the pod terminates.