0
votes

I know you shouldn't create a ceph cluster on a single node. But this is just a small private project and so I dont have the resources or need for a real cluster.

But I want to get a cluster up and I have some issues. Currently my cluster is down and I get the following health issues.

[root@rook-ceph-tools-6bdcd78654-vq7kn /]# ceph status
  cluster:
    id:     12d9fbb9-73f3-4229-9ef4-6b7670324629
    health: HEALTH_WARN
            Reduced data availability: 33 pgs inactive
            68 slow ops, oldest one blocked for 26686 sec, osd.0 has slow ops
 
  services:
    mon: 1 daemons, quorum g (age 15m)
    mgr: a(active, since 44m)
    osd: 1 osds: 1 up (since 8m), 1 in (since 9m)
 
  data:
    pools:   2 pools, 33 pgs
    objects: 0 objects, 0 B
    usage:   1.0 GiB used, 465 GiB / 466 GiB avail
    pgs:     100.000% pgs unknown
             33 unknown

and

[root@rook-ceph-tools-6bdcd78654-vq7kn /]# ceph health detail
HEALTH_WARN Reduced data availability: 33 pgs inactive; 68 slow ops, oldest one blocked for 26691 sec, osd.0 has slow ops
[WRN] PG_AVAILABILITY: Reduced data availability: 33 pgs inactive
    pg 2.0 is stuck inactive for 44m, current state unknown, last acting []
    pg 3.0 is stuck inactive for 44m, current state unknown, last acting []
    pg 3.1 is stuck inactive for 44m, current state unknown, last acting []
    pg 3.2 is stuck inactive for 44m, current state unknown, last acting []
    pg 3.3 is stuck inactive for 44m, current state unknown, last acting []
    pg 3.4 is stuck inactive for 44m, current state unknown, last acting []
    pg 3.5 is stuck inactive for 44m, current state unknown, last acting []
    pg 3.6 is stuck inactive for 44m, current state unknown, last acting []
    pg 3.7 is stuck inactive for 44m, current state unknown, last acting []
    pg 3.8 is stuck inactive for 44m, current state unknown, last acting []
    pg 3.9 is stuck inactive for 44m, current state unknown, last acting []
    pg 3.a is stuck inactive for 44m, current state unknown, last acting []
    pg 3.b is stuck inactive for 44m, current state unknown, last acting []
    pg 3.c is stuck inactive for 44m, current state unknown, last acting []
    pg 3.d is stuck inactive for 44m, current state unknown, last acting []
    pg 3.e is stuck inactive for 44m, current state unknown, last acting []
    pg 3.f is stuck inactive for 44m, current state unknown, last acting []
    pg 3.10 is stuck inactive for 44m, current state unknown, last acting []
    pg 3.11 is stuck inactive for 44m, current state unknown, last acting []
    pg 3.12 is stuck inactive for 44m, current state unknown, last acting []
    pg 3.13 is stuck inactive for 44m, current state unknown, last acting []
    pg 3.14 is stuck inactive for 44m, current state unknown, last acting []
    pg 3.15 is stuck inactive for 44m, current state unknown, last acting []
    pg 3.16 is stuck inactive for 44m, current state unknown, last acting []
    pg 3.17 is stuck inactive for 44m, current state unknown, last acting []
    pg 3.18 is stuck inactive for 44m, current state unknown, last acting []
    pg 3.19 is stuck inactive for 44m, current state unknown, last acting []
    pg 3.1a is stuck inactive for 44m, current state unknown, last acting []
    pg 3.1b is stuck inactive for 44m, current state unknown, last acting []
    pg 3.1c is stuck inactive for 44m, current state unknown, last acting []
    pg 3.1d is stuck inactive for 44m, current state unknown, last acting []
    pg 3.1e is stuck inactive for 44m, current state unknown, last acting []
    pg 3.1f is stuck inactive for 44m, current state unknown, last acting []
[WRN] SLOW_OPS: 68 slow ops, oldest one blocked for 26691 sec, osd.0 has slow ops

ceph version 15.2.3 (d289bbdec69ed7c1f516e0a093594580a76b78d0) octopus (stable)

Client Version: version.Info{Major:"1", Minor:"18", GitVersion:"v1.18.6", GitCommit:"dff82dc0de47299ab66c83c626e08b245ab19037", GitTreeState:"clean", BuildDate:"2020-07-15T16:58:53Z", GoVersion:"go1.13.9", Compiler:"gc", Platform:"linux/amd64"}

Server Version: version.Info{Major:"1", Minor:"18", GitVersion:"v1.18.6", GitCommit:"dff82dc0de47299ab66c83c626e08b245ab19037", GitTreeState:"clean", BuildDate:"2020-07-15T16:51:04Z", GoVersion:"go1.13.9", Compiler:"gc", Platform:"linux/amd64"}

kubeadm version: &version.Info{Major:"1", Minor:"18", GitVersion:"v1.18.6", GitCommit:"dff82dc0de47299ab66c83c626e08b245ab19037", GitTreeState:"clean", BuildDate:"2020-07-15T16:56:34Z", GoVersion:"go1.13.9", Compiler:"gc", Platform:"linux/amd64"}

If anyone knows where to start or how to fix my issue please help!

1
There are some default settings like replication size 3 for new pools (Ceph is designed as a failure resistent storage system, so you need redundancy). That means you need three OSDs to get all PGs active. Add two more disks and your cluster will most likely get to a healthy state. If you can't add more disks you can try to reduce min_size and size of your pool to 1 (which is dangerous), and for that you'll also need this setting: osd_crush_chooseleaf_type = 0. In general it's questionable why you would use ceph if you can't have redundancy, why not use the disk with a regular file system?eblock

1 Answers

0
votes

Yes, agreed with eblock mentioned above. You should have more than 3 OSD ( min 3 disk , or 3 volume ... whatever )if you have at least 3 object replica on each OSD. The object’s contents within a placement group are stored in a set of OSDs, and placement groups do not own the OSD, they share it with other placement groups from the same pool or even other pools.

  • In case one OSD fails and all copies of the object it contains, are lost. For all objects within the placement group the number of replica suddenly drops from three to two. Ceph starts recovery for this placement group by choosing a new OSD to re-create the third copy of all objects.

  • If another OSD, within the same placement group, fails before the new OSD is fully populated with the third copy. Some objects will then only have one surviving copies.

  • If a third OSD, within the same placement group, fails before recovery is completed then this OSD contained the only remaining copy of an object, it is permanently lost.

So it's very inportanat to chose a right pg number during creating the pools :

Total PGs = ( OSDs×100)/poolsize

pool size is a number of replica (in this case 3 )