Dell / Force10 MXL Firmware Bugs

We recently overhauled our whole network as our 1Gbps network running on unsupported ebay gear which wasn’t cutting it anymore.  I’ll go into more detail regarding upgrade later but for now I am going to focus on the access switches that we chose,  The Dell / Force10 MXL (DF10MXL)!   We chose the Force10 MXL because it offers both 1 and 10Gbps server-side connectivity and 10 and 40Gbps up-links, “FCOE”, and its well priced!  However with the exception of a couple issues outlined they have been pretty decent switches.

Issue Number 1 – MAC addresses and memory leaks.

We started on firmware version 9.5.0.1.  It did not take us long to realize our VMWare environment was a little too much for these switches.  With VMs and consequently MAC addresses being moved all over our network due to VMotion we started to have random IP address reachability issues and rarely we would have switches reboot.  We quickly learned that issuing the command “clear mac-address-table dynamic all” on the switches servicing the IP address in question resolved the issue and the IP address was again reachable.  After a little time on Google and browsing through Force10 documentation we found the following in the release notes for firmware version 9.6.0.0 which is the latest release after 9.5.0.1.

Microcode (Resolved) (Resolved in version 9.6.0.0)
PR# 140496
Severity: Sev 2
Synopsis: System may experience memory leak when it learns new MAC addresses continuously.
Release Notes: When MAC addresses are learned continuously, the system may fail to release allocated memory if internal software processes are busy processing newly learned MAC addresses and may experience a reboot due to memory exhaustion.
Workaround: None

We found our issue!.. or so we thought.  At the time we did not have access to firmware version 9.6.0.0 so we looked in the archive for the latest release without this issue.  This lead us to 9.5(0.0P2).  After a whole day of downgrading switches, 40 in total, our environment calmed down and our issues disappeared. Yey!

Issue Number 2 – Running hot.

Five weeks later we started to notice some of our switches running extremely hot.  60-100 degrees Celsius or 140-212 degrees Fahrenheit.  We were seeing a lot syslog messages from these switches with reboot warnings but no actual reboots.  It didn’t take long for the reboots to start.  The four to five switches that were running in excess of 70 degrees Celsius started to reboot at random intervals.  After beating our way around Dell support we were able to get some answers.  Firmware version 9.5(0.0P2) contains a bug that does not correctly report temperature / requested fan speed to the M1000e chassis.  The chassis were only running at 30% fan speed regardless of how hot the switches were getting.  For a temporary solution Dell pointed us to the RACADM Command Line Reference Guide found here.  Using this guide we were able to manually set the fan speed on our chassis to cool the switches.  Here is a post explaining exactly how to do that.  We settled on 65% fan speed.  This kept the switches cool and the noise level down.

Issue Number 3 – Stack Formation.

FTOS 9.6.0.0 will not form a 4 switch stack.  No documentation is available as to why.  When the 4th switch joins the stack the 3rd and the 4th switch kernel panic and reboot.

So… Force10 FTOS in a Nutshell.

  • 9.5(0.0P2) contains a bug that does not report temperature and/or requested fan speed correctly to the chassis and as a result it runs too hot and reboots.
  • 9.5.0.1 doesn’t run hot but has mac-address mobility issues which can apparently be worked around by enabling MAC Masquerading.  This is done with one simple command “mac-address-table station-move refresh-arp”  I am hesitant to take this route as we could still experience the memory leak issue noted above.
  • 9.6.0.0 is available and should resolve both of our issues but I am beginning to wonder what other ‘features’ we my find the latest release.
  • Update on 9.6.0.0  If you have more than 3 switches in a stack the 4th switch will continuously reboot as it tries to join the stacked cluster.

More to come.  For now the fans are hard set to 65% and here are some fun graphs to look at showing the temperatures before and after setting the fan speeds.

Operating Temperature Drop on Force10 MXL with 65% minimum fan speed.

MXL-Temps

Power Impact of 65% minimum fan speed.

PDU Power Monitoring

 

Update – 4/1/2015

  • Today is April 1st of 2015.  Dell just recently release 9.7.0.0 and we tested it in a lab for a few weeks before throwing it into production.  FTOS / Dell OS 9.7.0.0 appears to resolve all of our issues.  The switches no longer run hot, the switches form a stack like they should, and we have not had any reboots.  I’ll follow up in a few weeks to let you know if if we happen to have an issues.

Update – 6/9/2015

  • 9.7.0.0 is rock solid.  Use it.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.