The Ceph defaults for this are a little too aggressive for most devices, this will give you a more reasonable recovery speed that does not tank the system as hard but still yields a quick stable recovery.
ceph config set osd osd_recovery_sleep_hdd 0.25
ceph config set osd osd_recovery_sleep_ssd 0.05
ceph config set osd osd_recovery_sleep_hybrid 0.10
Sometimes you have failures that cannot be fixed… ie EC 2+1 and 2 drives failing… (btw this was the recommended default EC profile of 14.x..) and you should use 8+3 at minimum to prevent this!
Warning, everything below ensures data loss on the affected PG.
ceph pg PGID query | jq .acting
# Stop OSD related to PG, figure out the shard id of the pg, generally its .s0, .s1, .s2 depending on your EC config.
ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-0/ --pgid PGID.s1/2 --force --op remove
# Restart the osd, wait for it to attempt to peer, stop it then mark it complete.
ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-0/ --pgid PGID.s1/2 --op mark-complete
# Tell the customer your mistake is acceptable..
ceph pg 13.df mark_unfound_lost delete