Lisle Solution Center Network Refurb part II

OK, in part one I went over the objectives that I wanted to achieve. In part II I’m going to go over the resources at my disposal. There are aspects of this that are compromise, but at the same time I decided to make priority what I wanted to. This is an interim environment that will carry us over to our 10G roll out. It is designed with a certain level of fault tolerance in mind, although it’s not as high as I would have liked.

Physically I am starting out with 2 Dell 6248s and 2 3750G-24P switches connected to a set of 2 3750G-48P switches acting as the cores. Each switch has a single 1G aggregate in LACP. I have a single uplink to the main network.

Due to having 2 stack cables, I made the decision that I would stack the 3750Gs that were previously acting as standalone switches. The reasoning is that I only have a single uplink so having that level of redundancy at the switch level is not really going to do much of anything. I am running 4 links from the 3750G-48Ps which will be used for the ESXi cluster to the 3750G-24Ps. Due to the stack cables, each of these switches acts as a single larger switch. This means that because I have 2 uplinks from each of the 48 port switches to the 24 port switches I can lose links or have the stacking cable fail on either side and remain up. It also means that I am getting 8GBPS of bandwidth to the core switches, which is going to be more important when I roll out 10G networking through those cores. This means I am going to only use the 3750G-48Ps as the ESXi cluster switches, and eliminate the Dell 6248s in the present configuration entirely. They will be re-used at some later point.

Logically, I’m really starting from scratch overall. There were some attempts at separating VLANs but that’s mangeld overall along with the vSwitch config in ESXi. I’ve decided to split the following items into separate VLANs, which is a very conservative plan but I think will give room to grow.

Non Routable:
-Internal /20 for people to configure networks on. I will probably allocate chunks to users on an IP plan of some form.
-/24 for managed power strips. Way overkill but who cares? Easier than running out of IPs
-/24 for switch management
-/24 for IPMI devices
-/24 for VMotion
-/24 for VMware Management.

Routable:
I have 3 /24s that are on the intranet. One of my goals with this networking function is to try and reclaim as much of that space as plausible just because I’m not sure what my difficulty in getting more space is. I would rather have this space be efficiently used no matter the challenge, so pushing things like IPMI over to non-routable IPs is a priority to me.

The last thing I will mention is I am documenting the physical configuration, port utilization and the addressing (VLANs and IPs.) This is probably the most critical step of doing this sort of reconfiguration. It’s way easier for the next guy (who may be you) to troubleshoot a configuration that it’s known where stuff is. They also tend to scale a lot better.

Next time, I’ll cover what I’m doing in the ESXi configuration and on the switch side. This is the stuff with a big learning curve!

The amount of home devices using wifi that you don’t even think about.

I never really thought about it, but almost all electronics now have built in wireless and are connected to your network.  Blu-ray players, game consoles, phones, tablets, tvs, printers!

This is what my home network looks like right now.

home_network

 Bandwidth Usage

With on-demand streaming its rare to even watch cable tv.   I had no idea how much bandwidth I’ve been using, My ISP must hate me!

traffic_nov trffic_dec

5 Years of Sudo

Well its been 5 years since we started Sudo Make Install.  I want to say its been a rather big success but I don’t know how you judge a blog’s success?  We get comments on articles and have constant top search engine results for specific keywords.. and over 100,000 visits per month!

Here’s to another 5 years of IT.

Party People

Juniper SA (Junos Pulse) Multi User Authentication

We’ve had a Juniper SA700 for around 5 years now and it has proven to be an absolutely brilliant bit of hardware, however we’ve always had this issue where it would bump our connection if we connected to the same user multiple times.

To get around this you can enable “Multiple User Sessions” which translates to multiple sessions per user.

On your main window click “Authentication > Signing In” and check the box for “Enable Multiple User Sessions”

juniper_signing_in

Once selected hit save changes.   Now Navigate to “Users > User Realms > [Users, Or other Realm name] > Authentication Policy > Limits”

Change this value to some sane number, you don’t want your system being tied up with dead connections.  We’ve opted for 5.

juniper_limits

 

Now click save and enjoy multiple connections!

Dell / Force10 MXL Firmware Bugs

We recently overhauled our whole network as our 1Gbps network running on unsupported ebay gear which wasn’t cutting it anymore.  I’ll go into more detail regarding upgrade later but for now I am going to focus on the access switches that we chose,  The Dell / Force10 MXL (DF10MXL)!   We chose the Force10 MXL because it offers both 1 and 10Gbps server-side connectivity and 10 and 40Gbps up-links, “FCOE”, and its well priced!  However with the exception of a couple issues outlined they have been pretty decent switches.

Issue Number 1 – MAC addresses and memory leaks.

We started on firmware version 9.5.0.1.  It did not take us long to realize our VMWare environment was a little too much for these switches.  With VMs and consequently MAC addresses being moved all over our network due to VMotion we started to have random IP address reachability issues and rarely we would have switches reboot.  We quickly learned that issuing the command “clear mac-address-table dynamic all” on the switches servicing the IP address in question resolved the issue and the IP address was again reachable.  After a little time on Google and browsing through Force10 documentation we found the following in the release notes for firmware version 9.6.0.0 which is the latest release after 9.5.0.1.

Microcode (Resolved) (Resolved in version 9.6.0.0)
PR# 140496
Severity: Sev 2
Synopsis: System may experience memory leak when it learns new MAC addresses continuously.
Release Notes: When MAC addresses are learned continuously, the system may fail to release allocated memory if internal software processes are busy processing newly learned MAC addresses and may experience a reboot due to memory exhaustion.
Workaround: None

We found our issue!.. or so we thought.  At the time we did not have access to firmware version 9.6.0.0 so we looked in the archive for the latest release without this issue.  This lead us to 9.5(0.0P2).  After a whole day of downgrading switches, 40 in total, our environment calmed down and our issues disappeared. Yey!

Issue Number 2 – Running hot.

Five weeks later we started to notice some of our switches running extremely hot.  60-100 degrees Celsius or 140-212 degrees Fahrenheit.  We were seeing a lot syslog messages from these switches with reboot warnings but no actual reboots.  It didn’t take long for the reboots to start.  The four to five switches that were running in excess of 70 degrees Celsius started to reboot at random intervals.  After beating our way around Dell support we were able to get some answers.  Firmware version 9.5(0.0P2) contains a bug that does not correctly report temperature / requested fan speed to the M1000e chassis.  The chassis were only running at 30% fan speed regardless of how hot the switches were getting.  For a temporary solution Dell pointed us to the RACADM Command Line Reference Guide found here.  Using this guide we were able to manually set the fan speed on our chassis to cool the switches.  Here is a post explaining exactly how to do that.  We settled on 65% fan speed.  This kept the switches cool and the noise level down.

Issue Number 3 – Stack Formation.

FTOS 9.6.0.0 will not form a 4 switch stack.  No documentation is available as to why.  When the 4th switch joins the stack the 3rd and the 4th switch kernel panic and reboot.

So… Force10 FTOS in a Nutshell.

  • 9.5(0.0P2) contains a bug that does not report temperature and/or requested fan speed correctly to the chassis and as a result it runs too hot and reboots.
  • 9.5.0.1 doesn’t run hot but has mac-address mobility issues which can apparently be worked around by enabling MAC Masquerading.  This is done with one simple command “mac-address-table station-move refresh-arp”  I am hesitant to take this route as we could still experience the memory leak issue noted above.
  • 9.6.0.0 is available and should resolve both of our issues but I am beginning to wonder what other ‘features’ we my find the latest release.
  • Update on 9.6.0.0  If you have more than 3 switches in a stack the 4th switch will continuously reboot as it tries to join the stacked cluster.

More to come.  For now the fans are hard set to 65% and here are some fun graphs to look at showing the temperatures before and after setting the fan speeds.

Operating Temperature Drop on Force10 MXL with 65% minimum fan speed.

MXL-Temps

Power Impact of 65% minimum fan speed.

PDU Power Monitoring

 

Update – 4/1/2015

  • Today is April 1st of 2015.  Dell just recently release 9.7.0.0 and we tested it in a lab for a few weeks before throwing it into production.  FTOS / Dell OS 9.7.0.0 appears to resolve all of our issues.  The switches no longer run hot, the switches form a stack like they should, and we have not had any reboots.  I’ll follow up in a few weeks to let you know if if we happen to have an issues.

Update – 6/9/2015

  • 9.7.0.0 is rock solid.  Use it.

Dell M1000e Manually Configure Set Minimal Fan Speed Control

You can manually configure the minimum fan speed of the m1000e so that the chassis maintains a lower operating temperature.

SSH The CMC with the cmc ip address and port 22.  User will be root and calvin unless changed.

Then run:

racadm config -g cfgThermal -o cfgThermalMFSPercent  75

This will set the minimum fan speed to 75%.  You can set it from 0-100%.  Obviously 0% is more like 35% but you won’t be able to tell.

You can view the requested fan speed by the servers in the chassis by running:

racadm getfanreqinfo

Example:

[Server Module Fan Request Table]

<Slot#>   <Server Name>   <Blade Type>       <Power State>  <Presence>   <Fan Request%>   

1         s2086.corp PowerEdgeM610      ON             Present      48               

2         s2087.corp PowerEdgeM610      ON             Present      48               

3         s2088.corp PowerEdgeM610      ON             Present      48               

[Switch Module Fan Request Table]

<IO>      <Name>                           <Type>             <Presence>   <Fan Request%>   

Switch-1  MXL 10/40GbE                     10 GbE KR          Present      30               

Switch-2  MXL 10/40GbE                     10 GbE KR          Present      30               

Switch-3  N/A                              None               Not Present  N/A              

Switch-4  N/A                              None               Not Present  N/A              

Switch-5  N/A                              None               Not Present  N/A              

Switch-6  N/A                              None               Not Present  N/A              

[Minimum Fan Speed %]

65

Lisle Solution Center Network Refurb part I

Long time no post! After I became a field tech for EMC, my posting went by the wayside. Amazing how life goes, isn’t it? Well, just in time to not be a new years resolution I’m going to start posting some more stuff up at SMI. I have a ton of new stuff due to becoming Manager of the Lisle Solution Center. The LSC is a lab located at the Lisle EMC Office we actively demo from. It has a myriad of infrastructure from VPLEX to VMAX to XtremIO and software such as ViPR and VMware Site Recovery Manager.  The initial architecture is something that has fallen victim to atrophy that happens when we continuously build and adapt an environment. My goals when designing a network are really as follows:

Cleanliness-This is a biggie. Few like to work on a network that’s a rats nest and poorly organized, fewer still want to have anything to do with rebuilding it. This can be a big issue when service comes into play. Lack of standardization can be a huge issue. We’re actually going to be re-racking this environment at some point into a rack with better cable management (need power for the new rack first) so getting the base line in place is a big priority.

Standardization-There will always be some exceptions to standardization, but having it start to finish is a great thing. This means that when I’m done everything from the order the ports on the servers to the way the vSwitches are configured will have a method to them. This is also procedural standardization which means start to end certain things will be done with host adds.

Documentaion-Documentation is one of the critical steps of a build out that’s often overlooked. Ultimately I want the whole environment documented, but with networking being foundational it’s a very good place to start. Since I’m in charge, it will also make life easier on anyone who works in the LSC.

Lean-Less can easily be more. Some of the key purposes of this network restructure are to reduce switch count that’s needless and eliminate hundreds of feet of cabling. I know that there are many, many environments that could do the same very easily.

Reliable-Although redundancy of design is key, reliability to me is a few aspects. Reliability encompasses the aspects above as well. If you don’t know what goes where and can’t troubleshoot easily then it becomes far easier for human error to come into play.

 

Next time I will go over the initial environment, the resources at my disposal and the changes I’m looking to make. Stay Tuned!