Rook Ceph Operator hangs when checking for cluster status

Question

I've setup a k8s cluster on digital ocean Ubuntu 18.04 LTS droplets using calico on top of wireguard vpn, and was able to setup nginx-ingress with traefik as external LB. I'm now on the step of setting up distributed storage using rook ceph, by following the quickstart at https://rook.io/docs/rook/master/ceph-quickstart.html, but it seems like the monitors never reach a quorum (even when its just one). Actually, monitor a reaches by itself, but neither the operator or any other monitors seem to know that, and the operator hangs when trying to check the status.

I've tried troubleshooting network issues, all the way from wireguard, calico and ufw. I've even set ufw to temporarily allow all traffic by default just to make sure I wasn't allowing one port but the traffic was on another interface (i have wg0, eth1, tunl0 and the calico interfaces).

The I followed the ceph troubleshooting guide unsuccessfully: http://docs.ceph.com/docs/master/rados/troubleshooting/troubleshooting-mon/#recovering-a-monitor-s-broken-monmap

I've been 4 days at this and I'm out of solutions.

Here's how I setup the storage cluster

cd cluster/examples/kubernetes/ceph
kubectl apply -f common.yaml
kubectl apply -f operator.yaml
kubectl apply -f cluster-test.yaml

Running kubectl get pods returns

NAME                                      READY   STATUS    RESTARTS   AGE
pod/rook-ceph-agent-9ws2p                 1/1     Running   0          24s
pod/rook-ceph-agent-v6v9n                 1/1     Running   0          24s
pod/rook-ceph-agent-x2jv4                 1/1     Running   0          24s
pod/rook-ceph-mon-a-74cc6db5c8-8s5l5      1/1     Running   0          9s
pod/rook-ceph-operator-7cd5d8bd4c-pclxp   1/1     Running   0          25s
pod/rook-discover-24cfj                   1/1     Running   0          24s
pod/rook-discover-6xsnp                   1/1     Running   0          24s
pod/rook-discover-hj4tc                   1/1     Running   0          24s

However, when I try to check the status of the monitors, from the operator pod I get:

#This hangs forever
kubectl exec -it rook-ceph-operator-7cd5d8bd4c-pclxp ceph status

#This hangs foverer
kubectl exec -it rook-ceph-operator-7cd5d8bd4c-pclxp ceph ping mon.a

#This returns [errno 2] error calling ping_monitor
#Which I guess should, becasue mon.b does/should not exist
#But I expected a response such as mon.b does not exist
kubectl exec -it rook-ceph-operator-7cd5d8bd4c-pclxp ceph ping mon.b

Pinging the monitor pod from the operator works just fine by the way

Operator logs https://gist.github.com/figassis/0a3f499f5e3f79a430c9bd58718fd29f#file-operator-log

Monitor a logs https://gist.github.com/figassis/0a3f499f5e3f79a430c9bd58718fd29f#file-mon-a-log

Monitor a status, obtainer directly form monitor pod via socket https://gist.github.com/figassis/0a3f499f5e3f79a430c9bd58718fd29f#file-mon-a-status

Also, I don't know if its related, but the monitor logs keep showing "No Filesystems configured", which I assume should not affect connectivity. If that's an error, it should be returned in the status response, not hang correct? — Assis Ngolo
@Crou yes I have, it basically has the same tools as the ceph operators and monitors. So I can call ceph status from the toolbox or from the operator, and I did, and the result is the same. All of the commands hang. — Assis Ngolo

Sarvesha Dudhgaonkar Sarvesha Dudhgaonkar · Accepted Answer · 2019-08-20T09:18:31

You can execute ceph status command inside ceph toolbox pod.

https://github.com/rook/rook/blob/master/Documentation/ceph-toolbox.md

Rook Ceph Operator hangs when checking for cluster status

1 Answers