1
votes

We are using Prometheus for metrics collection. Prometheus will be deployed as a container and it will collect metrics from various sources and stores the data in the local machine (where the container is running). If the node which holds the container failed we are losing metrics along with that node as Prometheus stored all metrics in that local machine. Kubernetes will detect container failure and span that container in a healthy node but we have lost data in the old node.​ ​​

​To solve this issue we have come with two ideas either

  1. we have to decouple the whole Prometheus from Kubernetes.

    • We need to make sure of high availability for the Prometheus server and data of the Prometheus server. Also, we need to make sure authentication for Prometheus. There is some security concern here as Prometheus is not shipped with auth by default prometheus-basic-auth, we have to use a reverse proxy to handle authentication. Prometheus needs to talk with Kubernetes internal component so we need to make a secure way for that too.
  2. we have to decouple the storage alone eg: NFS like protocol (PV in Kubernetes term).

    • ​We need to make sure of high availability for data of Prometheus. Need to secure NFS.

Which one should we use?

If any other industry solution exists share that too. If any of the above has unmentioned side effects kindly let me know.

3
Using PV is perfectly fine.Oleg Butuzov
@Butuzov For PV which type of NAS you will suggest.user11779620

3 Answers

1
votes

Short term answer use a PV, but probably not NFS, you don’t need multiple writers. A simple network block device (EBS, GCPDisk, etc) is fine. Long term, HA Prometheus is a complex topic, check out the Thanos project for some ideas and tech. Also the Grafana Labs folks have been experimenting with some new HA layouts for it. Expect full HA Prometheus to be a very substantial project requiring you to dive deep into the internals.

1
votes

There is also another option in addition to storing the prometheus data as Persistence Volume(PV). You can use exporters by prometheus as mentioned here. These exporters get the scraped data and stores them in some external db like elastic search or mysql and can be used another prometheus instance in case the previous prometheus instance crashed.

0
votes

Prometheus can replicate data to remote storage via remote_write API. This means that the data isn't lost on Prometheus pod restart in k8s clustet, since it is already replicated to remote storage. See how to set up remote storage in Prometheus. This example uses VictoriaMetrics - cost efficient open source remote storage for Prometheus with the following features:

  • PromQL support.
  • Low resource usage - RAM, CPU, disk, network
  • Scales vertically and horizontally

I wouldn't recommend Thanos, since it doesn't prevent from data loss for the recent 2 hours on Prometheus pod restarts. See this article for details.