We have a situation where we have an abundance of Spring boot applications running in containers (on OpenShift) that access centralized infrastructure (external to the pod) such as databases, queues, etc.
If a piece of central infrastructure is down, the health check returns "unhealthy" (rightfully so). The problem is that the liveliness check sees this, and restarts the pod (the readiness check then sees it's down too, so won't start the app). This is fine when only a few are available, but if many (potentially hundreds) of applications are using this, it forces restarts on all of them (crash loop).
I understand that central infrastructure being down is a bad thing. It "should" never happen. But... if it does (Murphy's law), it throws containers into a frenzy. Just seems like we're either doing something wrong, or we should reconfigure something.
A couple questions:
- If you are forced to use centralized infrastructure from a Spring boot app running in a container on OpenShift/Kubernetes, should all actuator checks still be enabled for backend? (bouncing the container really won't fix the backend being down anyway)
- Should the /actuator/health endpoint be set for both the liveliness probe and the readiness probe?
- What common settings do folk use for the readiness/liveliness probe in a spring boot app? (timeouts/interval/etc).