Ceph Journal Device Ownership

Some devices struggle with persistent ownership due to the driver.   I have some OCZ ssd used for journal that are affected by this.

So I’ve created a udev rule to assign them to the proper user during boot.

Add this to /etc/udev/rules.d/89-ceph-journal.rules

KERNEL=="oczpcie*?" SUBSYSTEM=="block" OWNER="ceph" GROUP="disk" MODE="0660"

Then retrigger it to test

udevadm trigger --action=add
ls -lh /dev/oczpcie_3_0_ssd*
brw-rw---- 1 ceph disk 251, 0 Jul 14 13:01 /dev/oczpcie_3_0_ssd
brw-rw---- 1 ceph disk 251, 1 Jul 14 13:01 /dev/oczpcie_3_0_ssd1
brw-rw---- 1 ceph disk 251, 4 Jul 14 13:01 /dev/oczpcie_3_0_ssd4
brw-rw---- 1 ceph disk 251, 5 Jul 14 13:01 /dev/oczpcie_3_0_ssd5
brw-rw---- 1 ceph disk 251, 6 Jul 14 13:01 /dev/oczpcie_3_0_ssd6
brw-rw---- 1 ceph disk 251, 7 Jul 14 13:01 /dev/oczpcie_3_0_ssd7
brw-rw---- 1 ceph disk 251, 8 Jul 14 13:01 /dev/oczpcie_3_0_ssd8
brw-rw---- 1 ceph disk 251, 9 Jul 14 13:01 /dev/oczpcie_3_0_ssd9

Ceph Recovery Limits and Tips

Production Maintenance

These are the essentials for keeping a cluster working while doing backfill / recovery.

ceph tell osd.* injectargs '--osd_recovery_op_priority 1'
ceph tell osd.* injectargs '--osd_max_backfills 1'
ceph tell osd.* injectargs '--osd-recovery-max-active 1'
ceph tell osd.* injectargs '--osd-max-recovery-threads 1'
ceph tell osd.* injectargs '--osd-client-op-priority 63'

Taking out OSD on small Clusters

Sometimes, typically in a “small” cluster with few hosts (for instance with a small testing cluster), the fact to take out the OSD can spawn a CRUSH corner case where some PGs remain stuck in the active+remapped state. If you are in this case, you should mark the OSD in with:

ceph osd in {osd-num}

to come back to the initial state and then, instead of marking out the OSD, set its weight to 0 with:

ceph osd crush reweight osd.{osd-num} 0

After that, you can observe the data migration which should come to its end. The difference between marking out the OSD and reweighting it to 0 is that in the first case the weight of the bucket which contains the OSD is not changed whereas in the second case the weight of the bucket is updated (and decreased of the OSD weight). The reweight command could be sometimes favoured in the case of a “small” cluster.