Fixing a Ceph Mon map after disaster!

Cephs weakest leak is configuration.. once a cluster is deployed is incredibly durable and will survive most mistakes without punishment. However adding a monitor that is unreachable via all machines can yield a very broken cluster that cannot be managed.

For example, if you add a new monitor and the automatically detected ip (ansible or kolla) isn’t correct, possibly a loopback or other assigned ip, you will loose the ability to use the ceph tools on the cluster because of a broken monitor map config.

So heres what you need to know in a nut shell to fix it.

  1. Stop your monitors
  2. Export a monitor map from the last known good monitor
  3. Edit the monitor map to fix the broken entry
  4. Repeat this for all the monitors that were “working”.
  5. Inject the monitor maps on those monitors
  6. Start the monitors and check for them to forum a quorum.
ceph-mon -c /etc/ceph/cluster-name-ceph.conf -i MONITOR_NAME  --extract-monmap /tmp/monmap
monmaptool --print /tmp/monmap
monmaptool --rm bad-host-entry /tmp/monmap
monmaptool --print /tmp/monmap
ceph-mon --c /etc/ceph/cluster-name-ceph.conf -i MONITOR_NAME --inject-monmap /tmp/monmap
chown ceph:ceph -R /var/lib/ceph/mon/cluster-monitor-name/
systemctl start ceph-mon.target


Published
Categorized as Ceph, Storage

Network Switch PC

Several years ago I was given a small network switch from my high school.  The switch was defective and dropped packets constantly so I wanted to give it new life.

Render

Removed the audio port riser from the motherboard so it would fit in the unit all the way.

Got a 1U heatsink for a server from dynatron.  Clears the lid just perfectly with back plate!

Fired up the system to do a load test to see if the blower was adequately powerful cooling.

Got our Intel 1G nic mounted with flexible PCI-e x4 adapter cable.

 

Soldered the internal terminals of the uplink port to a cat5 cable and plugged it in to our nic.

 

 

Made some lexan disk mounts and airflow containment.

Stacked sata cables are a challenge to figure out and get 2 compatible cables.

 

Soldered USB headers onto 4 of the ethernet ports and made ethernet to usb adapter cables!

Installing linux with software raid 1.

All done, you would never know!

Build the NAS from hell from an old nimble CS460

About 5 years ago we bought some nimble storage arrays for customer services… well those things are out of production and since they have the street value of 3 pennys I figured it was time to reverse engineer and use them for other purposes.

The enclosure is made by supermicro, its a bridge bay server which has 2 E5600 based systems attached to one side of the SAS backplane and 2 internal 10G interfaces. It appears they have a USB drive to boot an image of the OS and then they store configuration on a shared LVM or some sort of cluster filesystem on the drives themselves. Each controller has what looks like a 1GB NVRam to Flash pci-e card that is used to ack writes as they come in, and get mirrored internally over the 10G interfaces.

I plan to use one controller (server) as my Plex Media box and the other one for virtual machines. The plan right now is to use BTRFS for the drives and use BCache for SSD acceleration of the block devices. I can run iSCSI over the internal interface to provide storage to the 2nd controller as VM host.

To be continued.

— Update

Found out both of my controllers had bad motherboards, one was fine with a single cpu and would randomly restart, the other wouldn’t post. I feel bad for anyone still running a nimble, its a ticking time bomb. So I grabbed 2 controllers off ebay for $100 shipped, they got here today and both were good. I went ahead and flashed the firmware to the supermicro vanilla so I could get access to the bios. I had to use the internal USB port as nimbles firmware disables the rest of the USB boot devices and the bios password is set even with defaults so you can’t login. I tried the available password on the ole interwebs but nothing seemed to work, it only accepts 6 chars but the online passwords are 8-12.

 

Looks like bcacheFS is gonna be the next badass filesystem now that btrfs has been dropped by redhat. Will have full write offloading and cache support like ZFS so we can use the NVRam card. Speaking of write cache, I have an email into NetList to try and get the kernel module for their 1G NVram write cache card. Worse case scenario I have to pull it out of the kernel nimble was using…

As of writing this I have both controllers running CentOS7 installed to their own partitions on the first drive in the array, and I have /boot and the boot loader installed to the 4G USB drives that nimble had their bootloader installed to.

 

sda 8:0 0 558.9G 0 disk
sdb 8:16 0 558.9G 0 disk
sdc 8:32 0 558.9G 0 disk
sdd 8:48 0 558.9G 0 disk
sde 8:64 0 1.8T 0 disk
sdf 8:80 0 1.8T 0 disk
sdg 8:96 0 1.8T 0 disk
sdh 8:112 0 1.8T 0 disk
sdi 8:128 0 1.8T 0 disk
sdj 8:144 0 1.8T 0 disk
sdk 8:160 0 1.8T 0 disk
sdl 8:176 0 1.8T 0 disk
sdm 8:192 0 1.8T 0 disk
sdn 8:208 0 1.8T 0 disk
sdo 8:224 0 1.8T 0 disk
sdp 8:240 0 1.8T 0 disk
sdq 65:0 0 3.8G 0 disk

And I went ahead and created an MDRaid array on 6 of the spindle disk with LVM to get started messing with it. I need to get bcachefs compiled to the kernel and give that a go, will come with time!

Personalities : [raid6] [raid5] [raid4]
md0 : active raid5 sdj[6] sdi[4] sdh[3] sdg[2] sdf[1] sde[0]
      9766912000 blocks super 1.2 level 5, 512k chunk, algorithm 2 [6/5] [UUUUU_]
      [=>...................]  recovery =  7.7% (151757824/1953382400) finish=596.9min speed=50304K/sec
      bitmap: 5/15 pages [20KB], 65536KB chunk

Maybe I’ll dabble with iSCSI tomorrow.

— Update

Installed Plex Tonight, spent some time getting sonarr and other msc tools for acquring metadata and video from the interballs. Also started investigating bcache and bacachefs deployment in CentOS. http://10sa.com/sql_stories/?p=1052

Also started investigating some water blocks to potentially use water cooling on my NAS… its too loud and buying different heatsinks doesn’t seem very practical when a water block is $15 on ebay

 

–Update

I am def going to use water cooling, the 40mm fans are really annoying and this system has rather powerful E5645 cpus which have decent thermal output.   I found some 120MM aluminum radiators in ebay for almost nothing, so 2 blocks + fittings + hose is going to be around $80 per system.  I need to find a cheap pump option but I think I know what I’m doing there.

Heres a picture of one of the controller modules with the fans and a cpu removed.

 

A 80mm fan fits perfectly and 2 of the 3 bolt holes even line up to mount it in the rear of the chassis.  I will most likely order some better fans from delta with PWM/Speed capability so that the SM smart bios can properly speed them up and down.   You can see that supermicro/nimble put 0 effort into airflow management in these systems.  They are using 1U heatsinks with no ducting at all so airflow is “best efforts” I would guess the front cpu probably runs 40-50C most of its life simply due to the fact airflow is only created by a fixed 40mm fan in front of it.

 

–Update

Welp I got the news I figured I would about the NV1 card from NetList,  it is EOL and they stopped driver r development for it.  They were nice enough to send me ALL of the documentation and kernel module though, it supports up to kernel 2.6.38 so you could run latest centos 6 and get it supported.. maybe ill mess with that?  I attached it here incase anyone wants the firmware or linux kernel module driver for the Netlist NV1.  Netlist-1.4-6 Release

UCS Server Troubleshooting

If you’re having inventory or hardware issues during discovery in UCS, there is a few commands that can help clear it up and get the system to properly inventory.

The latest method I’ve learned about from TAC is to do the following.

decommission slot 6/5
reset slot 6/5
acknowledge slot 6/5

Ceph Recovery Limits and Tips

Production Maintenance

These are the essentials for keeping a cluster working while doing backfill / recovery.  You can play with the recovery sleep interval depending on how fast your devices are.  Some SSD can live with 0.01 and others prefer values like 0.25.   I’ve worked with a lot of OLD worn out SSD that have a lot of garbage collection and they work best at 0.25 or more.  Keep in mind the higher the value the less the cluster will queue for recovery.  Longer recovery vs performance…

ceph tell osd.* injectargs '--osd_recovery_op_priority 1'
ceph tell osd.* injectargs '--osd_max_backfills 1'
ceph tell osd.* injectargs '--osd-recovery-max-active 1'
ceph tell osd.* injectargs '--osd-client-op-priority 63'
#This is the magic nut butter for keeping things running smooth, default is 0s, we use 0.50s so things are stable while we recover.
ceph tell osd.* injectargs '--osd_recovery_sleep 0.50'

Taking out OSD on small Clusters

Sometimes, typically in a “small” cluster with few hosts (for instance with a small testing cluster), the fact to take out the OSD can spawn a CRUSH corner case where some PGs remain stuck in the active+remapped state. If you are in this case, you should mark the OSD in with:

ceph osd in {osd-num}

to come back to the initial state and then, instead of marking out the OSD, set its weight to 0 with:

ceph osd crush reweight osd.{osd-num} 0

After that, you can observe the data migration which should come to its end. The difference between marking out the OSD and reweighting it to 0 is that in the first case the weight of the bucket which contains the OSD is not changed whereas in the second case the weight of the bucket is updated (and decreased of the OSD weight). The reweight command could be sometimes favoured in the case of a “small” cluster.

Published
Categorized as Storage

High end consumer SSD benchmarks

Running consumer ssd in a server has been deemed hazardous and silly… but that’s only the case when your utilizing a hardware raid solution.

Provided you have UPS systems and software storage that can talk to the disk directly its perfectly safe.  We use Ceph!

These are tested with FIO on a Dell M620 with H310 JBOD mode controller.

Micron/Crucial M500

running IO "sequential read" test... 
	result is 491.86MB per second

 running IO "sequential write" test... 
	result is 421.42MB per second

 running IO "seq read/seq write" test... 
	result is 228.74MB/184.88MB per second

 running IO "random read" test... 
	result is 240.35MB per second
	equals 61530.2 IOs per second

 running IO "random write" test... 
	result is 230.34MB per second
	equals 58968.2 IOs per second

 running IO "rand read/rand write" test... 
	result is 93.90MB/94.01MB per second
	equals 24038.8/24067.5 IOs per second

Micron/Crucial M550

running IO "sequential read" test... 
	result is 523.79MB per second

 running IO "sequential write" test... 
	result is 476.59MB per second

 running IO "seq read/seq write" test... 
	result is 211.70MB/173.50MB per second

 running IO "random read" test... 
	result is 253.36MB per second
	equals 64861.0 IOs per second

 running IO "random write" test... 
	result is 233.42MB per second
	equals 59754.2 IOs per second

 running IO "rand read/rand write" test... 
	result is 102.42MB/102.28MB per second
	equals 26219.5/26184.0 IOs per second

Micron M600

running IO "sequential read" test... 
 result is 507.47MB per second

running IO "sequential write" test... 
 result is 477.18MB per second

running IO "seq read/seq write" test... 
 result is 198.38MB/166.73MB per second

running IO "random read" test... 
 result is 244.66MB per second
 equals 62633.2 IOs per second

running IO "random write" test... 
 result is 238.35MB per second
 equals 61017.5 IOs per second

running IO "rand read/rand write" test... 
 result is 103.10MB/102.95MB per second
 equals 26393.8/26354.0 IOs per second

Sandisk 960GB SSD Extreme Pro

running IO "sequential read" test... 
 result is 394.66MB per second

running IO "sequential write" test... 
 result is 451.28MB per second

running IO "seq read/seq write" test... 
 result is 181.48MB/158.89MB per second

running IO "random read" test... 
 result is 255.99MB per second
 equals 65533.5 IOs per second

running IO "random write" test... 
 result is 223.86MB per second
 equals 57309.2 IOs per second

running IO "rand read/rand write" test... 
 result is 71.47MB/71.46MB per second
 equals 18296.0/18294.2 IOs per second

Crucial MX300 1TB

running IO "sequential read" test... 
	result is 504.80MB per second

 running IO "sequential write" test... 
	result is 501.97MB per second

 running IO "seq read/seq write" test... 
	result is 239.47MB/210.71MB per second

 running IO "random read" test... 
	result is 175.78MB per second
	equals 45000.0 IOs per second

 running IO "random write" test... 
	result is 291.85MB per second
	equals 74713.5 IOs per second

 running IO "rand read/rand write" test... 
	result is 137.10MB/137.09MB per second
	equals 35096.8/35095.0 IOs per second

Ceph — Basic Management of OSD location and weight in the crushmap

It’s amazing how crappy hard disk are!   No really!   We operate a 100 disk ceph pool for our object based backups and Its almost a weekly task to replace a failing drive.   I’ve only seen one go entirely unresponsive but normally we get read error and rear failures that stop the osd service and show up in dmesg as faults.

 

To change the weight of a drive:

ceph osd crush reweight osd.90 1.82

To replace a drive:

#Remove old disk
ceph osd out osd.31
ceph osd crush rm osd.31
ceph osd rm osd.31
ceph auth del osd.31
#Provision new disk
ceph-deploy osd prepare --overwrite-conf hostname01:/dev/diskname

Move a host into a different root bucket.

ceph osd crush move hostname01 root=BUCKETNAME

Linux FCOE + Dell Force10 MXL + Brocade VDX Switches + EMC VNX

Huge write-up coming.

 

Unless you’re a Cisco (Nexus) shop end-to-end there are a few design considerations you need to take into account when it comes to delivering FCoE to your Dell blade servers.

Things to Consider:

-How is the FC from your storage array being encapsulated into Ethernet?

Some storage arrays allow for the direct export of FCoE.  Some storage arrays have only FC connectivity options.  In this case you will need a device to encapsulate FC into FCOE, a FCF (Fiber Channel Forwarder).  Some example FCF devices would be Brocade VDX 6740, Brocade VDX 6730, and Cisco Nexus 5000.

-Are you running some vendor proprietary fabric that allows for multi-hop FCoE like FabricPath or VCS?

If so great!  If not, you’re gonna a fun time attempting to forward FCoE beyond the first switch. (Here is a blog explaining those options.)

-Are your servers connected to a true Fibre Channel Forwarding (FCF) access switch or are they connected to Fibre Channel Initialization Protocol (FIP) Snooping access bridge (switch)?

FIP Snooping Bridges (FSB) vs Fibre Channel Forwarders (FCF): A FSB must connect to an FCF in order for FCoE to function.  A FCF is a FSB that also provides FC Services like name server as well and FC/FCoE encapsulation.

-If you are using FIP Snooping accesses switches, how are these switches multi-homed?

-FIP Snooping Bridges carrying FCoE cannot be multi-homed to more than one FCF by any means.  No vLAG, mLAG or any other type of split chassis LACP, no spanning-tree, no dual-homing, period.

-FIP Snooping Bridges can, in some cases, connect to a single FCF using multiple links bundled in a standard LACP LAG.

How are your servers multi-homed?

Servers cannot be connected to a pair of FCFs using vLAG or mLAG.  Servers also cannot be connected to a stack or pair of FSBs using vLAG or mLAG

What we Have Done:

We have 3 different designs we have implemented.  All of them have their benefits and drawbacks.  This is our attempt to explain them and show you how to configure them.

Build 1.  Brocade VDX switches configured in a logical chassis cluster (VCS) providing FC to FCoE encapsulation as well as access to a multi-homed server using round-robin load balancing, not LACP.

Pros: In a perfect world this is how everything would work.  Redundancy without any extra links and minimal configuration.  Completely converged.

Cons: Dell and Brocade have not come together to build a VDX switch for the M1000e chassis yet.

Notes: You could use 10G pass-through in the back of the chassis to connect directly to VDX switches but thats at least 96 fibers for a 3 chassis rack and 128 for a 4 chassis rack.

FC-VCS-Server

Build 2. Brocade VDX switches configured in a logical chassis cluster (VCS) providing FC to FCoE encapsulation.  VDX switches connected to Dell MXL switches using vLAGs as well as a dedicated FCoE link per switch.  Each server is then multi-homed to a pair of MXL switches using round-robin load balancing.

Pros: Redundant.  Converged-ish.

Cons: Complicated.  More vendors.  There are 4 places in this network where a failure could result in exactly half of your storage paths being lost.  May* require use of Uplink Detection Failure on the FSBs to properly fail FCoE after the failure of a FCF.

Notes: FCoE links between the VDX and MXL cannot be multi-homed like the data path.  FCoE links can be bundled into a LACP LAG to provide additional bandwidth but specific rules regarding which port groups on the switches you can and cannot use.

FC-VCS-FCOE

Build 3. EMC VNX directly injecting FCoE into Brocade VDX switches configured in a logical chassis cluster (VCS).  VDX switches connected to Dell MXL switches using a single link for data and FCoE.  Each server is then multi-homed to a pair of MXL switches using round-robin load balancing.  This same idea could be applied if the storage was FC only as the VDXs will do the encapsulation.

Pros: Converged.

Cons: More vendors.  There are 4 places in this network where a failure could result in exactly half of your storage paths being lost.  May* require use of Uplink Detection Failure on the FSBs to properly fail FCoE after the failure of a FCF.  Data path redundancy is lost.

Notes:  This is an older design using VDX6730s which are now end-of-life.  The 6730s do not allow FCoE to traverse the TRILL fabric thus each path from the storage array to the server is completely isolated to either side of the network.

FCOE-VDX-MXL-Server

 

Configuration:

All of these configurations assume the Brocade VCS fabric is already built and using all default FCoE settings, maps, vlan, etc.

Build 1

  Brocade VDX interfaces connecting to storage array exporting FCoE.

interface TenGigabitEthernet 1/0/1
mtu 9216
no fabric isl enable
no fabric trunk enable
switchport
switchport mode trunk
switchport trunk allowed vlan all
switchport trunk tag native-vlan
spanning-tree shutdown
fcoeport default
no shutdown

Brocade VDX interfaces connecting to storage array exporting FC.

interface FibreChannel 1/0/1
no isl-r_rdy
trunk-enable
fec-enable
no shutdown

Brocade VDX interfaces connecting to server.

interface TenGigabitEthernet 1/0/2
mtu 9216
no fabric isl enable
no fabric trunk enable
switchport
switchport mode trunk
switchport trunk allowed vlan all
switchport trunk tag native-vlan
spanning-tree shutdown
fcoeport default
no shutdown

Build 2.

  Brocade VDX interfaces connecting to storage array exporting FCoE.

interface TenGigabitEthernet 1/0/1
mtu 9216
no fabric isl enable
no fabric trunk enable
switchport
switchport mode trunk
switchport trunk allowed vlan all
switchport trunk tag native-vlan
spanning-tree shutdown
fcoeport default
no shutdown

Brocade VDX interfaces connecting to storage array exporting FC.

interface FibreChannel 1/0/1
no isl-r_rdy
trunk-enable
fec-enable
no shutdown

Brocade VDX vLAG interface connecting to Dell MXL LAG to provide data-path.

interface Port-channel 1
vlag ignore-split
mtu 9216
switchport
switchport mode trunk
switchport trunk allowed vlan all
switchport trunk tag native-vlan
spanning-tree shutdown
no shutdown

Dell MXL LAG connecting to Brocade VDX vLAG to provide data-path.

no ip address
mtu 12000
portmode hybrid
switchport
no shutdown

Brocade VDX interface connecting to Dell MXL interface to provide FCOE

interface TenGigabitEthernet 1/0/1
mtu 9216
no fabric isl enable
no fabric trunk enable
switchport
switchport mode trunk
switchport trunk allowed vlan none
switchport trunk tag native-vlan
spanning-tree shutdown
fcoeport default
no shutdown

Dell MXL interface connecting to Brocade VDX interface to provide FCoE

interface TenGigabitEthernet 0/52
no ip address
mtu 12000
portmode hybrid
switchport
fip-snooping port-mode fcf
!
protocol lldp
no advertise dcbx-tlv ets-reco
dcbx port-role auto-upstream
no shutdown

Dell MXL VLAN configuration.

interface Vlan 1002
no ip address
mtu 2500
tagged TenGigabitEthernet 0/1-32,41-52
fip-snooping enable
no shutdown

Dell MXL feature configuration.

dcb-map FLEXIO_DCB_MAP_PFC_OFF
no pfc mode on
!
feature fip-snooping
fip-snooping enable
!
protocol lldp

Dell MXL interface connecting to Server.

interface TenGigabitEthernet 0/1
no ip address
mtu 12000
portmode hybrid
switchport spanning-tree pvst edge-port bpduguard
!
protocol lldp
dcbx port-role auto-downstream
no shutdown

 Build 3.

Brocade VDX to Dell MXL

Dell MXL to Brocade VDX

Dell MXL to server

 

Brocade FCOE to FCF Deployment Guide – http://community.brocade.com/dtscp75322/attachments/dtscp75322/ethernet/1203/1/FCoE%20Multipathing%20and%20LAG_Oct2013.pdf

Brocade Storage connectivity

– http://www.brocade.com/downloads/documents/html_product_manuals/brocade-vcs-storage-dp/GUID-F0C36164-140C-452C-80D9-983A37101E07.html




Brocade VDX (6730)
fcoe - default settings
fcoe
 fabric-map default
 vlan 1002
 priority 3
 virtual-fabric 128
 fcmap 0E:FC:00
 max-enodes 64
 enodes-config local
 advertisement interval 8000
 keep-alive timeout
 !
 map default
 fabric-map default
 cee-map default
lldp
protocol lldp
 advertise dcbx-fcoe-app-tlv
 advertise dcbx-fcoe-logical-link-tlv
 advertise dcbx-tlv

https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Storage_Administration_Guide/fcoe-config.html