cephadmn treats first host differently?? [DR Scenario]

Question

First off, I'm new to ceph. I want to use ceph for home use; migrating off zfs. So to learn, I got on GCP and setup some compute engines (e2-standard-2) all with Ubuntu 20.04 Minimal with 20GB of disk space and bunch of 10 gb disks to simulate data disks.

Following the guide for cephadm new cluster, I was able to create a cluster with 3 nodes each having a mon,mgr and mds. However to mirror my home setup all the OSDs were on the first host; I know not recommended but limited by available hardware. Able to get cephfs working and mounted and PV for Kubernetes cluster, etc..

Next I wanted to test DR, so shutdown the host with the OSDs (simulating OS disk loss); also the machine I did 'cephadm bootstrap --mon-ip *<mon-ip>*' from. The remaining two nodes still sort of worked, but they were much less responsive to queries on status and other information. Going to the dashboard sometimes worked, sometimes timed out.

Stood up a new compute engine and attached the OSDs hdd to the new machine, then tried to 'ceph orch host add *NEWHOST*' on of the working hosts and it just hangs (has a copy of the client.admin.keyring). (tons of errors in the logs because it can't talk to original node) I tried following the manual steps and creating a mon and osds on that NEWHOST, but adding the OSDs gave me errors.

So two main (and a bunch of follow-up) questions

What is so special about the bootstrap host? Isn't the point to have distributed nodes so that if did lose one everything still works? Is it because of my small cluster size that I am noticing these issues? Would this issue be resolved by running an 'admin/bootstrap node' on a PI and backing up the SD card? What am I doing wrong that I can't even add a new host after losing the 'original' host; I can shutdown the other hosts and still add new hosts.
DR documentation. I know my setup isn't standard; but people are using this for home use/small deployments and I can't imagine someone hasn't tested this or have this happen to them. The closest thing I found was here, it doesn't work for me most likely because of my lack of ceph familiarity. If someone helps me figure out the DR recovery steps for this, I'll write up the documentation.

There are a bunch of things we don't know yet to be able to give you an answer. But there isn't anything special on the bootstrap node, it's really just there to make it easier to get started. Please clarify what "The remaining two nodes still sort of worked, but they were much less responsive to queries on status and other information." means exactly. The bootstrap node is your first MON node so it should be a designated one. What is the current ceph status? — eblock

Norbert_Cs Norbert_Cs · Accepted Answer · 2020-12-18T09:43:11

the point is not on monitoring node , rather on OSDs , you have at least 3 object replica on each OSD, so you should have more than 3 OSD. ( The object’s contents within a placement group are stored in a set of OSDs, and placement groups do not own the OSD, they share it with other placement groups from the same pool or even other pools. ) is case of DR :

1 .The OSD fails and all copies of the object it contains are lost. For all objects within the placement group the number of replica suddenly drops from three to two.

Ceph starts recovery for this placement group by choosing a new OSD to re-create the third copy of all objects.
If another OSD, within the same placement group, fails before the new OSD is fully populated with the third copy. Some objects will then only have one surviving copies.
If a third OSD, within the same placement group, fails before recovery is completed then this OSD contained the only remaining copy of an object, it is permanently lost.

So the main thing is to chose a right pg number during creating the pools :

Total PGs = ( OSDs×100)/poolsize

pool size is a number of replica (in this case 3 )

cephadmn treats first host differently?? [DR Scenario]

1 Answers