0
votes

No matter how much I read the Prometheus docs, I can't seem to get what I expect to see out of the queries I write. I feel like I must be approaching the task completely orthogonal to how I'm supposed to. So maybe someone can help me by working through an example.

Let's say I have a service reads in a file and POSTs to an API endpoint for each line in the file and let's say the service runs once a minute and takes about 10 seconds to complete.

I would like to build a graph that shows the # of successful POSTs over time. My intuition is to use a Counter or Gauge metric type because they are the simplest but we're not really needing to record an integer value because the counting of # of POSTS would be done using the prometheus functions. So I create a Gauge JOB_SUCCESS that increments 1 each POST.

count(JOB_SUCCESS) shows a flat horizontal line == # of POSTS that have occurred so far. As more POSTs occur, the bar is raised for all time so I can't tell how many posts occurred at hour X vs hour Y.

count_over_time(JOB_SUCCESS[N]) shows something completely different depending on the value of N but I don't understand what it represents because it never decreases. If the job stops, the value just stays the same... even though, presumably, theres nothing to count of the N time. If the job runs once a minute for only 10 seconds, it rises for 10 seconds and plateaus for 50, then rises again. I would expect it go return to 0. How do I simply show the number of requests over time?

1

1 Answers

0
votes

I may have figured it out but I'll leave this up in case someone has a better suggestion. count(rate(JOB_SUCCESS[1m]))

The only issue that I don't understand is that there seems to an arbitrary minimum granularity, for me 1 minute. Any shorter than that, the stats disappear. Any longer, the values in the graph grow so I'm assuming a single POST is counting towards multiple stats within the window of time.