Ceph Recovery Limits and Tips

Production Maintenance

These are the essentials for keeping a cluster working while doing backfill / recovery.  You can play with the recovery sleep interval depending on how fast your devices are.  Some SSD can live with 0.01 and others prefer values like 0.25.   I’ve worked with a lot of OLD worn out SSD that have a lot of garbage collection and they work best at 0.25 or more.  Keep in mind the higher the value the less the cluster will queue for recovery.  Longer recovery vs performance…

ceph tell osd.* injectargs '--osd_recovery_op_priority 1'
ceph tell osd.* injectargs '--osd_max_backfills 1'
ceph tell osd.* injectargs '--osd-recovery-max-active 1'
ceph tell osd.* injectargs '--osd-max-recovery-threads 1'
ceph tell osd.* injectargs '--osd-client-op-priority 63'
#This is the magic nut butter for keeping things running smooth, default is 0s, we use 0.50s so things are stable while we recover.
ceph tell osd.* injectargs '--osd_recovery_sleep 0.50'

Taking out OSD on small Clusters

Sometimes, typically in a “small” cluster with few hosts (for instance with a small testing cluster), the fact to take out the OSD can spawn a CRUSH corner case where some PGs remain stuck in the active+remapped state. If you are in this case, you should mark the OSD in with:

ceph osd in {osd-num}

to come back to the initial state and then, instead of marking out the OSD, set its weight to 0 with:

ceph osd crush reweight osd.{osd-num} 0

After that, you can observe the data migration which should come to its end. The difference between marking out the OSD and reweighting it to 0 is that in the first case the weight of the bucket which contains the OSD is not changed whereas in the second case the weight of the bucket is updated (and decreased of the OSD weight). The reweight command could be sometimes favoured in the case of a “small” cluster.