First off, I'm new to ceph. I want to use ceph for home use; migrating off zfs. So to learn, I got on GCP and setup some compute engines (e2-standard-2) all with Ubuntu 20.04 Minimal with 20GB of disk space and bunch of 10 gb disks to simulate data disks.
Following the guide for cephadm new cluster, I was able to create a cluster with 3 nodes each having a mon,mgr and mds. However to mirror my home setup all the OSDs were on the first host; I know not recommended but limited by available hardware. Able to get cephfs working and mounted and PV for Kubernetes cluster, etc..
Next I wanted to test DR, so shutdown the host with the OSDs (simulating OS disk loss); also the machine I did 'cephadm bootstrap --mon-ip *<mon-ip>*' from. The remaining two nodes still sort of worked, but they were much less responsive to queries on status and other information. Going to the dashboard sometimes worked, sometimes timed out.
Stood up a new compute engine and attached the OSDs hdd to the new machine, then tried to 'ceph orch host add *NEWHOST*' on of the working hosts and it just hangs (has a copy of the client.admin.keyring). (tons of errors in the logs because it can't talk to original node) I tried following the manual steps and creating a mon and osds on that NEWHOST, but adding the OSDs gave me errors.
So two main (and a bunch of follow-up) questions
- What is so special about the bootstrap host? Isn't the point to have distributed nodes so that if did lose one everything still works? Is it because of my small cluster size that I am noticing these issues? Would this issue be resolved by running an 'admin/bootstrap node' on a PI and backing up the SD card? What am I doing wrong that I can't even add a new host after losing the 'original' host; I can shutdown the other hosts and still add new hosts.
- DR documentation. I know my setup isn't standard; but people are using this for home use/small deployments and I can't imagine someone hasn't tested this or have this happen to them. The closest thing I found was here, it doesn't work for me most likely because of my lack of ceph familiarity. If someone helps me figure out the DR recovery steps for this, I'll write up the documentation.
ceph status? - eblock