It’s amazing how crappy hard disk are! No really! We operate a 100 disk ceph pool for our object based backups and Its almost a weekly task to replace a failing drive. I’ve only seen one go entirely unresponsive but normally we get read error and rear failures that stop the osd service and show up in dmesg as faults.
To change the weight of a drive:
ceph osd crush reweight osd.90 1.82
To replace a drive:
#Remove old disk
ceph osd out osd.31
ceph osd crush rm osd.31
ceph osd rm osd.31
ceph auth del osd.31
#Provision new disk
ceph-deploy osd prepare --overwrite-conf hostname01:/dev/diskname
Move a host into a different root bucket.
ceph osd crush move hostname01 root=BUCKETNAME