Ceph Recovery Limits and Tips

Production Maintenance

These are the essentials for keeping a cluster working while doing backfill / recovery.

ceph tell osd.* injectargs '--osd_recovery_op_priority 1'
ceph tell osd.* injectargs '--osd_max_backfills 1'
ceph tell osd.* injectargs '--osd-recovery-max-active 1'
ceph tell osd.* injectargs '--osd-max-recovery-threads 1'
ceph tell osd.* injectargs '--osd-client-op-priority 63'

Taking out OSD on small Clusters

Sometimes, typically in a “small” cluster with few hosts (for instance with a small testing cluster), the fact to take out the OSD can spawn a CRUSH corner case where some PGs remain stuck in the active+remapped state. If you are in this case, you should mark the OSD in with:

ceph osd in {osd-num}

to come back to the initial state and then, instead of marking out the OSD, set its weight to 0 with:

ceph osd crush reweight osd.{osd-num} 0

After that, you can observe the data migration which should come to its end. The difference between marking out the OSD and reweighting it to 0 is that in the first case the weight of the bucket which contains the OSD is not changed whereas in the second case the weight of the bucket is updated (and decreased of the OSD weight). The reweight command could be sometimes favoured in the case of a “small” cluster.

Neutron DB upgrade failure.

"Cannot change column 'ipsec_site_conn_id': used in a foreign key constraint 'cisco_csr_identifier_map_ibfk_1'") [SQL: u'ALTER TABLE cisco_csr_identifier_map MODIFY ipsec_site_conn_id VARCHAR(36) NULL']

The upgrade sql can’t run because of a relation.

 

Remove the relationship on the ipsec_site_conn_id column.   Reduce the varchar from 64 to 32.  Re-create the relationship.

 

Increase ceph cluster pg and pgp placement groups without downtime

From the mailing list:

In fact, when you increase your pg number, the new pgs will have to peer first and during this time, a lot a pg will be unreachable. The best way to upgrade the number of PG of a cluster (you ‘ll need to adjust the number of PGP too) is :

 

  • Don’t forget to apply Goncalo advices to keep your cluster responsive for client operations. Otherwise, all the IO and CPU will be used for the recovery operations and your cluster will be unreachable. Be sure that all these new parameters are in place before upgrading your cluster
  • stop and wait for scrub and deep-scrub operations

ceph osd set noscrub
ceph osd set nodeep-scrub

  • set you cluster in maintenance mode with :
ceph osd set norecover
ceph osd set nobackfill
ceph osd set nodown
ceph osd set noout

wait for your cluster not have scrub or deep-scrub opration anymore
  • upgrade the pg number with a small increment like 256
  • wait for the cluster to create and peer the new pgs (about 30 seconds)
  • upgrade the pgp number with the same increment
  • wait for the cluster to create and peer (about 30 seconds)

(Repeat the last 4 operations until you reach the number of pg and pgp you want

At this time, your cluster is still functionnal.

  • Now you have to unset the maintenance mode
ceph osd unset noout
ceph osd unset nodown
ceph osd unset nobackfill
ceph osd unset norecover

It will take some time to replace all the pgs but at the end you will have a cluster with all pgs active+clean.During all the operation,your cluster will still be functionnal if you have respected Goncalo parameters

  • When all the pgs are active+clean, you can re-enable the scrub and deep-scrub operations

ceph osd unset noscrub
ceph osd unset nodeep-scrub

These are handy tips: http://cephnotes.ksperis.com/blog/2017/03/03/dealing-with-some-osd-timeouts

Recovery CentOS 7 with software mdadm

Boot rescue centos mode with live disk

edit /etc/mdadm.conf

DEVICE /dev/sda1

DEVICE /dev/sdb1

mdadm --examine --scan

mdadm --examine --scan >> /etc/mdadm.conf

mdadm --assemble --scan /dev/mdX




mount /dev/mdX2 /mnt/sysroot

mount /dev/mdX1 /mnt/sysroot/boot

mount --bind /sys /mnt/sysroot/sys

mount --bind /proc /mnt/sysroot/proc

mount --bind /dev /mnt/sysroot/dev

chroot /mnt/sysroot/

grub2-mkconfig -o /boot/grub2/grub.cfg

exit

umount /mnt/sysroot/sys

umount /mnt/sysroot/proc

umount /mnt/sysroot/dev

umount /mnt/sysroot/boot

umount /mnt/sysroot/

sync
reboot

Free up disk space from deleted files under running processes.

A lot of the time a large log file will grow and need removed,  most the time these files cannot actually be “deleted” or “cleared” until the service releases its file descriptor.

 

Identify the file:

List files recently deleted that have not been released.

 

root@osc-1015 #> lsof -a +L1
COMMAND      PID USER   FD   TYPE DEVICE   SIZE/OFF NLINK   NODE NAME
systemd-j   1059 root  txt    REG  253,4     278808     0  11628 /usr/lib/systemd/systemd-journald;570ba957 (deleted)
systemd-l   1451 root  txt    REG  253,4     584560     0  33117 /usr/lib/systemd/systemd-logind;570ba957 (deleted)
monitor     1617 root    5w   REG  253,4        500     0 261586 /var/log/openvswitch/ovsdb-server.log-20160404 (deleted)
monitor     1617 root    7u   REG  253,4        141     0     17 /tmp/tmpfsSX0WX (deleted)
ovsdb-ser   1619 root    7u   REG  253,4        141     0     17 /tmp/tmpfsSX0WX (deleted)
monitor     1722 root    3w   REG  253,4     474455     0 261589 /var/log/openvswitch/ovs-vswitchd.log-20160404 (deleted)
ceph-osd   20462 root  txt    REG  253,4   11589728     0  33573 /usr/bin/ceph-osd;570b9a69 (deleted)
ceph-osd   20686 root  txt    REG  253,4   11589728     0  33573 /usr/bin/ceph-osd;570b9a69 (deleted)
qemu-kvm  107850 qemu    8w   REG  253,4 2207794598     0 261623 /var/lib/nova/instances/8921a9ef-81c4-4a06-be00-7cad86bd6a1c/console.log (deleted)

in this instance I need to clear the console.log file

 

Release the kernel lock:

Now we will release its lock in the kernel.  The key parts here are the PID and FD,   We remove the write flag from the FD and use its ID.

 root@osc-1015#> : > "/proc/107850/fd/8"

 

Once ran the file is released and can be relocked by the process if it begins writing again.