0
votes

I have a PromQL query that is looking at max latency per quantile and displays the data in Grafana, but it shows data from a pod that is redeployed and no longer exists. The pod is younger than the staleness period of 15 days.

Here's the query: max(latency{quantile="..."})

The max latency found is from the time it was throttling, and shortly after it got redeployed and went back to normal, and now I want to look only at the max latency of what is currently live.

All the info that I found so far about staleness says it should be filtering behind the scenes, but doesn't look like it's happening in the current setup and I cannot figure out what should I change.

When adding manually in the query the specific instance ID - it works well, but the ID will change once it gets redeployed: max(latency{quantile="...", exported_instance="ID"})

Here is a long list of similar questions I found, some are not answered, some are not asking for the same. The ideas that I did find that are somewhat relevant but don't solve the problem in a sustainable way are:

Suggestions from the links below that were not helpful

  • change staleness period, won't work because it affects the whole system
  • restart Prometheus, won't work because it can't be done every time a pod is redeployed
  • list each graph per machine, won't work with a max query

Links to similar questions

The end goal

is displaying the max latency between all sources that are live now, dropping data from no longer existing sources.

1
This question seems to be confusing retention and staleness. Can you give example time series, and what output you want?brian-brazil

1 Answers

0
votes

You can use auto generated metric named up to isolate your required metrics from others. You can easily determine which metric sources are offline from up metric.

up{job="", instance=""}: 1 if the instance is healthy, i.e. reachable, or 0 if the scrape failed.