Linux FCOE + Dell Force10 MXL + Brocade VDX Switches + EMC VNX

Huge write-up coming.

 

Unless you’re a Cisco (Nexus) shop end-to-end there are a few design considerations you need to take into account when it comes to delivering FCoE to your Dell blade servers.

Things to Consider:

-How is the FC from your storage array being encapsulated into Ethernet?

Some storage arrays allow for the direct export of FCoE.  Some storage arrays have only FC connectivity options.  In this case you will need a device to encapsulate FC into FCOE, a FCF (Fiber Channel Forwarder).  Some example FCF devices would be Brocade VDX 6740, Brocade VDX 6730, and Cisco Nexus 5000.

-Are you running some vendor proprietary fabric that allows for multi-hop FCoE like FabricPath or VCS?

If so great!  If not, you’re gonna a fun time attempting to forward FCoE beyond the first switch. (Here is a blog explaining those options.)

-Are your servers connected to a true Fibre Channel Forwarding (FCF) access switch or are they connected to Fibre Channel Initialization Protocol (FIP) Snooping access bridge (switch)?

FIP Snooping Bridges (FSB) vs Fibre Channel Forwarders (FCF): A FSB must connect to an FCF in order for FCoE to function.  A FCF is a FSB that also provides FC Services like name server as well and FC/FCoE encapsulation.

-If you are using FIP Snooping accesses switches, how are these switches multi-homed?

-FIP Snooping Bridges carrying FCoE cannot be multi-homed to more than one FCF by any means.  No vLAG, mLAG or any other type of split chassis LACP, no spanning-tree, no dual-homing, period.

-FIP Snooping Bridges can, in some cases, connect to a single FCF using multiple links bundled in a standard LACP LAG.

How are your servers multi-homed?

Servers cannot be connected to a pair of FCFs using vLAG or mLAG.  Servers also cannot be connected to a stack or pair of FSBs using vLAG or mLAG

What we Have Done:

We have 3 different designs we have implemented.  All of them have their benefits and drawbacks.  This is our attempt to explain them and show you how to configure them.

Build 1.  Brocade VDX switches configured in a logical chassis cluster (VCS) providing FC to FCoE encapsulation as well as access to a multi-homed server using round-robin load balancing, not LACP.

Pros: In a perfect world this is how everything would work.  Redundancy without any extra links and minimal configuration.  Completely converged.

Cons: Dell and Brocade have not come together to build a VDX switch for the M1000e chassis yet.

Notes: You could use 10G pass-through in the back of the chassis to connect directly to VDX switches but thats at least 96 fibers for a 3 chassis rack and 128 for a 4 chassis rack.

FC-VCS-Server

Build 2. Brocade VDX switches configured in a logical chassis cluster (VCS) providing FC to FCoE encapsulation.  VDX switches connected to Dell MXL switches using vLAGs as well as a dedicated FCoE link per switch.  Each server is then multi-homed to a pair of MXL switches using round-robin load balancing.

Pros: Redundant.  Converged-ish.

Cons: Complicated.  More vendors.  There are 4 places in this network where a failure could result in exactly half of your storage paths being lost.  May* require use of Uplink Detection Failure on the FSBs to properly fail FCoE after the failure of a FCF.

Notes: FCoE links between the VDX and MXL cannot be multi-homed like the data path.  FCoE links can be bundled into a LACP LAG to provide additional bandwidth but specific rules regarding which port groups on the switches you can and cannot use.

FC-VCS-FCOE

Build 3. EMC VNX directly injecting FCoE into Brocade VDX switches configured in a logical chassis cluster (VCS).  VDX switches connected to Dell MXL switches using a single link for data and FCoE.  Each server is then multi-homed to a pair of MXL switches using round-robin load balancing.  This same idea could be applied if the storage was FC only as the VDXs will do the encapsulation.

Pros: Converged.

Cons: More vendors.  There are 4 places in this network where a failure could result in exactly half of your storage paths being lost.  May* require use of Uplink Detection Failure on the FSBs to properly fail FCoE after the failure of a FCF.  Data path redundancy is lost.

Notes:  This is an older design using VDX6730s which are now end-of-life.  The 6730s do not allow FCoE to traverse the TRILL fabric thus each path from the storage array to the server is completely isolated to either side of the network.

FCOE-VDX-MXL-Server

 

Configuration:

All of these configurations assume the Brocade VCS fabric is already built and using all default FCoE settings, maps, vlan, etc.

Build 1

  Brocade VDX interfaces connecting to storage array exporting FCoE.

interface TenGigabitEthernet 1/0/1
mtu 9216
no fabric isl enable
no fabric trunk enable
switchport
switchport mode trunk
switchport trunk allowed vlan all
switchport trunk tag native-vlan
spanning-tree shutdown
fcoeport default
no shutdown

Brocade VDX interfaces connecting to storage array exporting FC.

interface FibreChannel 1/0/1
no isl-r_rdy
trunk-enable
fec-enable
no shutdown

Brocade VDX interfaces connecting to server.

interface TenGigabitEthernet 1/0/2
mtu 9216
no fabric isl enable
no fabric trunk enable
switchport
switchport mode trunk
switchport trunk allowed vlan all
switchport trunk tag native-vlan
spanning-tree shutdown
fcoeport default
no shutdown

Build 2.

  Brocade VDX interfaces connecting to storage array exporting FCoE.

interface TenGigabitEthernet 1/0/1
mtu 9216
no fabric isl enable
no fabric trunk enable
switchport
switchport mode trunk
switchport trunk allowed vlan all
switchport trunk tag native-vlan
spanning-tree shutdown
fcoeport default
no shutdown

Brocade VDX interfaces connecting to storage array exporting FC.

interface FibreChannel 1/0/1
no isl-r_rdy
trunk-enable
fec-enable
no shutdown

Brocade VDX vLAG interface connecting to Dell MXL LAG to provide data-path.

interface Port-channel 1
vlag ignore-split
mtu 9216
switchport
switchport mode trunk
switchport trunk allowed vlan all
switchport trunk tag native-vlan
spanning-tree shutdown
no shutdown

Dell MXL LAG connecting to Brocade VDX vLAG to provide data-path.

no ip address
mtu 12000
portmode hybrid
switchport
no shutdown

Brocade VDX interface connecting to Dell MXL interface to provide FCOE

interface TenGigabitEthernet 1/0/1
mtu 9216
no fabric isl enable
no fabric trunk enable
switchport
switchport mode trunk
switchport trunk allowed vlan none
switchport trunk tag native-vlan
spanning-tree shutdown
fcoeport default
no shutdown

Dell MXL interface connecting to Brocade VDX interface to provide FCoE

interface TenGigabitEthernet 0/52
no ip address
mtu 12000
portmode hybrid
switchport
fip-snooping port-mode fcf
!
protocol lldp
no advertise dcbx-tlv ets-reco
dcbx port-role auto-upstream
no shutdown

Dell MXL VLAN configuration.

interface Vlan 1002
no ip address
mtu 2500
tagged TenGigabitEthernet 0/1-32,41-52
fip-snooping enable
no shutdown

Dell MXL feature configuration.

dcb-map FLEXIO_DCB_MAP_PFC_OFF
no pfc mode on
!
feature fip-snooping
fip-snooping enable
!
protocol lldp

Dell MXL interface connecting to Server.

interface TenGigabitEthernet 0/1
no ip address
mtu 12000
portmode hybrid
switchport spanning-tree pvst edge-port bpduguard
!
protocol lldp
dcbx port-role auto-downstream
no shutdown

 Build 3.

Brocade VDX to Dell MXL

Dell MXL to Brocade VDX

Dell MXL to server

 

Brocade FCOE to FCF Deployment Guide – http://community.brocade.com/dtscp75322/attachments/dtscp75322/ethernet/1203/1/FCoE%20Multipathing%20and%20LAG_Oct2013.pdf

Brocade Storage connectivity

– http://www.brocade.com/downloads/documents/html_product_manuals/brocade-vcs-storage-dp/GUID-F0C36164-140C-452C-80D9-983A37101E07.html




Brocade VDX (6730)
fcoe - default settings
fcoe
 fabric-map default
 vlan 1002
 priority 3
 virtual-fabric 128
 fcmap 0E:FC:00
 max-enodes 64
 enodes-config local
 advertisement interval 8000
 keep-alive timeout
 !
 map default
 fabric-map default
 cee-map default
lldp
protocol lldp
 advertise dcbx-fcoe-app-tlv
 advertise dcbx-fcoe-logical-link-tlv
 advertise dcbx-tlv

https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Storage_Administration_Guide/fcoe-config.html

CEPH Scrubbing impact on client io and performance.

Ceph’s default IO priority and class for behind the scene disk operations should be considered required vs best efforts. For those of us who actually utilize our storage for services that require performance will quickly find that deep scrub grinds even the most powerful systems to a halt.

Below are the settings to run the scrub as the lowest possible priority. This REQUIRES CFQ as the scheduler for the spindle disk. Without CFQ you cannot prioritize IO. Since only 1 service utilizes these disk CFQ performance will be comparable to deadline and noop.

Inject the new settings for the existing OSD:
ceph tell osd.* injectargs '--osd_disk_thread_ioprio_priority 7'
ceph tell osd.* injectargs '--osd_disk_thread_ioprio_class idle'

Edit your ceph.conf on your storage nodes to automatically set the the priority at runtime.
#Reduce impact of scrub.
osd_disk_thread_ioprio_class = "idle"
osd_disk_thread_ioprio_priority = 7

You can go a step further and setup redhats optimizations for the system charactistics.
tuned-adm profile latency-performance
This information referenced from multiple sources.

Reference documentation.
http://dachary.org/?p=3268

Disable scrubbing in realtime to determine its impact on your running cluster.
http://dachary.org/?p=3157

A detailed analysis of the scrubbing io impact.
http://blog.simon.leinen.ch/2015/02/ceph-deep-scrubbing-impact.html

OSD Configuration Reference
http://ceph.com/docs/master/rados/configuration/osd-config-ref/

Redhat system tuning.
https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/7/html/Performance_Tuning_Guide/sect-Red_Hat_Enterprise_Linux-Performance_Tuning_Guide-Tool_Reference-tuned_adm.html

Quick and Dirty Ceph Deployment

Replace the disk names and ssd device name.   This will build a ceph cluster with 2 object redundancy in about 5 minutes.

ceph-deploy purge ceph0-mon0 ceph0-mon1 ceph0-mon2 ceph0-node0 ceph0-node1
ceph-deploy purgedata ceph0-mon0 ceph0-mon1 ceph0-mon2 ceph0-node0 ceph0-node1
ceph-deploy forgetkeys


ceph-deploy new ceph0-mon0 ceph0-mon1 ceph0-mon2

echo "osd pool default size = 2" >> ~/ceph.conf
echo "public network = 10.1.8.0/22" >> ~/ceph.conf
echo "cluster network = 10.1.12.0/22" >> ~/ceph.conf
echo "osd journal size = 12000" >> ~/ceph.conf

ceph-deploy install ceph0-mon0 ceph0-mon1 ceph0-mon2 ceph0-node0 ceph0-node1
ceph-deploy mon create-initial

ceph-deploy admin ceph0-mon0 ceph0-mon1 ceph0-mon2 ceph0-node0 ceph0-node1

sudo chmod +r /etc/ceph/ceph.client.admin.keyring

ceph-deploy disk zap ceph0-node0:/dev/oczpcie_4_0_ssd
ceph-deploy disk zap ceph0-node0:/dev/sdb
ceph-deploy disk zap ceph0-node0:/dev/sdc
ceph-deploy disk zap ceph0-node0:/dev/sdd
ceph-deploy disk zap ceph0-node0:/dev/sde
ceph-deploy disk zap ceph0-node0:/dev/sdf
ceph-deploy disk zap ceph0-node0:/dev/sdg
ceph-deploy disk zap ceph0-node0:/dev/sdh
ceph-deploy disk zap ceph0-node0:/dev/sdi
ceph-deploy disk zap ceph0-node0:/dev/sdj
ceph-deploy disk zap ceph0-node0:/dev/sdk
ceph-deploy disk zap ceph0-node0:/dev/sdl
ceph-deploy disk zap ceph0-node0:/dev/sdm

ceph-deploy disk zap ceph0-node1:/dev/oczpcie_4_0_ssd
ceph-deploy disk zap ceph0-node1:/dev/sdb
ceph-deploy disk zap ceph0-node1:/dev/sdc
ceph-deploy disk zap ceph0-node1:/dev/sdd
ceph-deploy disk zap ceph0-node1:/dev/sde
ceph-deploy disk zap ceph0-node1:/dev/sdf
ceph-deploy disk zap ceph0-node1:/dev/sdg
ceph-deploy disk zap ceph0-node1:/dev/sdh
ceph-deploy disk zap ceph0-node1:/dev/sdi
ceph-deploy disk zap ceph0-node1:/dev/sdj
ceph-deploy disk zap ceph0-node1:/dev/sdk
ceph-deploy disk zap ceph0-node1:/dev/sdl
ceph-deploy disk zap ceph0-node1:/dev/sdm

ceph-deploy osd prepare ceph0-node0:/dev/sdb:/dev/oczpcie_4_0_ssd
ceph-deploy osd prepare ceph0-node1:/dev/sdb:/dev/oczpcie_4_0_ssd

ceph-deploy osd prepare ceph0-node0:/dev/sdc:/dev/oczpcie_4_0_ssd
ceph-deploy osd prepare ceph0-node1:/dev/sdc:/dev/oczpcie_4_0_ssd

ceph-deploy osd prepare ceph0-node0:/dev/sdd:/dev/oczpcie_4_0_ssd
ceph-deploy osd prepare ceph0-node1:/dev/sdd:/dev/oczpcie_4_0_ssd

ceph-deploy osd prepare ceph0-node0:/dev/sde:/dev/oczpcie_4_0_ssd
ceph-deploy osd prepare ceph0-node1:/dev/sde:/dev/oczpcie_4_0_ssd

ceph-deploy osd prepare ceph0-node0:/dev/sdf:/dev/oczpcie_4_0_ssd
ceph-deploy osd prepare ceph0-node1:/dev/sdf:/dev/oczpcie_4_0_ssd

ceph-deploy osd prepare ceph0-node0:/dev/sdg:/dev/oczpcie_4_0_ssd
ceph-deploy osd prepare ceph0-node1:/dev/sdg:/dev/oczpcie_4_0_ssd

ceph-deploy osd prepare ceph0-node0:/dev/sdh:/dev/oczpcie_4_0_ssd
ceph-deploy osd prepare ceph0-node1:/dev/sdh:/dev/oczpcie_4_0_ssd

ceph-deploy osd prepare ceph0-node0:/dev/sdi:/dev/oczpcie_4_0_ssd
ceph-deploy osd prepare ceph0-node1:/dev/sdi:/dev/oczpcie_4_0_ssd

ceph-deploy osd prepare ceph0-node0:/dev/sdj:/dev/oczpcie_4_0_ssd
ceph-deploy osd prepare ceph0-node1:/dev/sdj:/dev/oczpcie_4_0_ssd

ceph-deploy osd prepare ceph0-node0:/dev/sdk:/dev/oczpcie_4_0_ssd
ceph-deploy osd prepare ceph0-node1:/dev/sdk:/dev/oczpcie_4_0_ssd

ceph-deploy osd prepare ceph0-node0:/dev/sdl:/dev/oczpcie_4_0_ssd
ceph-deploy osd prepare ceph0-node1:/dev/sdl:/dev/oczpcie_4_0_ssd

ceph-deploy osd prepare ceph0-node0:/dev/sdm:/dev/oczpcie_4_0_ssd
ceph-deploy osd prepare ceph0-node1:/dev/sdm:/dev/oczpcie_4_0_ssd