3
votes

I've been having a really frustrating issue where Kubernetes services randomly stop being available on their cluster IPs after around a few hours of deployment. They almost seem to be ageing.

My pods are having hostNetwork: true and dnsPolicy: ClusterFirstWithHostNet. Here is where things get interesting - I have two namespaces (staging and production) on the afflicted cluster. On another identical cluster with just one namespace, this issue hasn't seem to have appeared yet!

On trying to look at the kube-proxy logs, here is what I see:

admin@gke ~ $ tail /var/log/kube-proxy.log
E0115 12:13:01.669222       5 proxier.go:1372] can't open "nodePort for staging/foo:foo-sip-1" (:31765/tcp), skipping this nodePort: listen tcp :31765: bind: address already 
in use
E0115 12:13:01.671353       5 proxier.go:1372] can't open "nodePort for staging/foo:http-api" (:30932/tcp), skipping this nodePort: listen tcp :30932: bind: address already in use
E0115 12:13:01.671548       5 proxier.go:1372] can't open "nodePort for staging/our-lb:our-lb-http" (:32477/tcp), skipping this nodePort: listen tcp :32477: bind: address alrea
dy in use
E0115 12:13:01.671641       5 proxier.go:1372] can't open "nodePort for staging/foo:foo-sip-0" (:30130/tcp), skipping this nodePort: listen tcp :30130: bind: address already 
in use
E0115 12:13:01.671710       5 proxier.go:1372] can't open "nodePort for default/foo:foo-sip-0" (:30132/tcp), skipping this nodePort: listen tcp :30132: bind: address already 
in use
E0115 12:13:02.510177       5 proxier.go:1372] can't open "nodePort for default/our-lb:our-lb-http" (:31613/tcp), skipping this nodePort: listen tcp :31613: bind: address alrea
dy in use
E0115 12:13:06.577412       5 server.go:661] starting metrics server failed: listen tcp 127.0.0.1:10249: bind: address already in use
E0115 12:13:11.578446       5 server.go:661] starting metrics server failed: listen tcp 127.0.0.1:10249: bind: address already in use
E0115 12:13:16.580441       5 server.go:661] starting metrics server failed: listen tcp 127.0.0.1:10249: bind: address already in use
E0115 12:13:21.583691       5 server.go:661] starting metrics server failed: listen tcp 127.0.0.1:10249: bind: address already in use

I've now deleted one namespace from the afflicted cluster and the remaining one seems to have fixed itself; but I am curious about why Kubernetes didn't warn me at the time of resource creation, and if it wasn't competing for resources, then why does it reassign them later on in a way that causes this issue? This can't be a DNS cache issue, because getent hosts shows me the right cluster IP for the service - that IP just isn't reachable! It really seems to me to be a bug in the Kubernetes networking setup.

Should I be creating an issue, or is there something obvious that I'm doing incorrectly?

1

1 Answers

1
votes

It sounds like you have pods with hostNetwork: true and use services with type: NodePort and set fix node port number to be the same as the one your pod will be using.

Generally, unless you have a very compelling use-case, you should avoid hostNetwork: true. It's mostly for use with legacy applications or daemons that require access to the host network. If you do need to use a service along with your pods that are on host network, you should use a service with type: ClusterIP.