I've been having a really frustrating issue where Kubernetes services randomly stop being available on their cluster IPs after around a few hours of deployment. They almost seem to be ageing.
My pods are having hostNetwork: true
and dnsPolicy: ClusterFirstWithHostNet
. Here is where things get interesting - I have two namespaces (staging and production) on the afflicted cluster. On another identical cluster with just one namespace, this issue hasn't seem to have appeared yet!
On trying to look at the kube-proxy
logs, here is what I see:
admin@gke ~ $ tail /var/log/kube-proxy.log
E0115 12:13:01.669222 5 proxier.go:1372] can't open "nodePort for staging/foo:foo-sip-1" (:31765/tcp), skipping this nodePort: listen tcp :31765: bind: address already
in use
E0115 12:13:01.671353 5 proxier.go:1372] can't open "nodePort for staging/foo:http-api" (:30932/tcp), skipping this nodePort: listen tcp :30932: bind: address already in use
E0115 12:13:01.671548 5 proxier.go:1372] can't open "nodePort for staging/our-lb:our-lb-http" (:32477/tcp), skipping this nodePort: listen tcp :32477: bind: address alrea
dy in use
E0115 12:13:01.671641 5 proxier.go:1372] can't open "nodePort for staging/foo:foo-sip-0" (:30130/tcp), skipping this nodePort: listen tcp :30130: bind: address already
in use
E0115 12:13:01.671710 5 proxier.go:1372] can't open "nodePort for default/foo:foo-sip-0" (:30132/tcp), skipping this nodePort: listen tcp :30132: bind: address already
in use
E0115 12:13:02.510177 5 proxier.go:1372] can't open "nodePort for default/our-lb:our-lb-http" (:31613/tcp), skipping this nodePort: listen tcp :31613: bind: address alrea
dy in use
E0115 12:13:06.577412 5 server.go:661] starting metrics server failed: listen tcp 127.0.0.1:10249: bind: address already in use
E0115 12:13:11.578446 5 server.go:661] starting metrics server failed: listen tcp 127.0.0.1:10249: bind: address already in use
E0115 12:13:16.580441 5 server.go:661] starting metrics server failed: listen tcp 127.0.0.1:10249: bind: address already in use
E0115 12:13:21.583691 5 server.go:661] starting metrics server failed: listen tcp 127.0.0.1:10249: bind: address already in use
I've now deleted one namespace from the afflicted cluster and the remaining one seems to have fixed itself; but I am curious about why Kubernetes didn't warn me at the time of resource creation, and if it wasn't competing for resources, then why does it reassign them later on in a way that causes this issue? This can't be a DNS cache issue, because getent hosts
shows me the right cluster IP for the service - that IP just isn't reachable! It really seems to me to be a bug in the Kubernetes networking setup.
Should I be creating an issue, or is there something obvious that I'm doing incorrectly?