I have run into a bit of a trouble for what is seems to be an easy question.
My scenario: I have a k8s job which can be run at any time (not a cronJob) which in turn creates a pod to perform some tasks. Once the pod performs its task it completes, thus completing the job that spawned it.
What I want: I want to alert via prometheus if the pod is in a running state for more than 1h signalling that the task is taking too much time. I'm interested to alert ONLY when duration symbolised by the arrow in the attached image exceeds 1h. Also have no alerts triggered when the pod is no longer running.
What I tried: The following prometheus metric, which is an instant vector that can be either 0(pod not running) or 1(pod is running):
kube_pod_status_ready{condition="true",pod_name=~".+POD-A.+"}
I figured I tried to use this metric with the following formula for computing the duration for when the metric was one during a day
(1 - avg_over_time(kube_pod_status_ready{condition="true",pod_name=~".+POD-A.+"}[1d])) * 86400 > 3600
Because these pods come and go and are not always present I'm encountering the following problems:
- The expr above starts from the 86400 value and eventually drops once the container is running this would trigger an alert
- The pod eventually goes away and I would not like to send out fake alerts for pods which are no longer running(although they took over 1h to run)
sum_over_time(kube_pod_status_ready{condition="true",pod_name=~".+POD-A.+"}[1h:1s]) == 3600
? – Matt