Friday, October 15, 2010

A Comparison of Current Spanning-Tree Elimination Strategies

As I mentioned in the last post, I attended the Net Tech Field Day event hosted by Gestalt IT in September. My focus in attending was on Data Center switching technologies. Of particular interest to me was the methods by which each vendor is attempting to eliminate spanning-tree from the data center. While I have been keeping my eye on TRILL and 802.1aq, I am more interested in how vendors are solving this issue today.
All of the current solutions can be described as Multi-Chassis Link Aggregation (MLAG) methods. Cisco has three solutions available for this purpose. The 3750 and 2975 switches perform chassis aggregation via proprietary stacking cables. This stacking feature allows a network engineer to create a single switch out of multiple physical devices. All devices in a stack are managed via a single control plane. Cisco’s 6500 series switches have a similar feature, called Virtual Switching System or VSS, which uses standard 10gb interfaces to achieve the same result. At the current time, VSS is limited to aggregating two chassis, but Cisco’s goal is to extend this to more devices. On the Nexus 7k and 5k platforms, the virtual Port-Channel (vPC) feature allows two physical devices to be logically paired together to present a common switching platform to connected devices. The important difference between vPC and the Stacking/VSS methods is that the control planes of the vPC devices are separate.
Juniper and HP both described their visions of a single control plane for the data center. Juniper went into great detail about their stacking technology (called Virtual Chassis) for fixed-configuration switches, as well as their standard Ethernet-based method for interconnecting modular switches. HP was less technical in their presentation. By my best guess, they have a VSS-style Ethernet interconnection method.
Force10’s VirtualScale technology combines the control planes of two or more switches to offer MLAG. The connections between the switches are standard 1 or 10gb links.
According to Arista Networks, their MLAG solution can pair two switches into a single logical switch. It isn’t clear from the documentation whether this feature combines the two control planes or keeps them separate. The configuration documentation is behind a paywall :(
Here’s a table of the vendors and where their solutions reside:
Proprietary Stacking 1/10gb Stacking Separated Control Planes
Arista Networks X
Cisco Systems X X X
Force10 X
Hewlett-Packard ? X (I think)
Juniper X (I think) X

My Thoughts

I am not yet comfortable with combining control planes in a data center environment. I much prefer Cisco’s vPC method of spanning-tree elimination over the stacking and VSS methods. There are several factors that contribute the this point of view. First, I was bitten by a VSS bug about 18 months ago. I suppose I should chalk that up to being an early adopter, but I guess I hold a grudge :)
Second, the shared-fate aspect of a single control plane makes me uncomfortable. When I strive to eliminate single points of failure in the data center, I look for the following items:
  1. Single-Attached Servers – If a server owner chooses to take this risk, I am not responsible for the impact of a switch or cable failure.
  2. Port-Channel Diversity – I work to ensure that single-device to single-device port-channels are built using separate modules on chassis-based switches. I also attempt to diversify the cable paths. For example, I’ll run one cable of a port-channel up the left side of a rack, and the diverse cable up the right side. If the opportunity presents itself, I’ll utilize a mix of copper and fiber in a single port-channel for an extra level of comfort, although I’ll admit that this is excessive in typical Data Centers.
  3. Power Diversity for Paired Switches – When two switches are configured as a pair (for example, when individual servers are connected to both switches), I ensure that they are powered by different PDUs, or are at least on different UPSs. if separate UPSs are unavailable, it is preferable not to have the second switch on a UPS at all. To look at it another way, I’d rather have a single switch up for 30 minutes, versus a pair of switches up for 15 minutes. While I haven’t implemented this idea in my data centers, I am intrigued by it as a method for reducing the load on our Data Center UPSs. (The same goes for servers performing duplicate functions, if sysadmins are still reading this).
  4. Control-Plane Diversity – If a single reload command can take down my entire data center (even momentarily), I don’t quite have diversity. I’ve heard the “Operator error is the cause of most IT downtime” mantra often enough for it to have sunk in, at least a bit. If the reload command doesn’t concern you, think about how a simple configuration error would no longer be isolated to a single switch.
I’ll stop the list here, but there are probably many others I haven’t listed. Feel free to mention your favorites in the comments and I’ll add them here with appropriate credit.

11 comments:

@JezAtHP said...

Hi Jeremy,

Great Article. The technical WhitePaper for the HP Intelligent Resillient Framework can be found at http://www.hp.com/networking/IRF .

Rapid Failover sub 50milliseconds, virtualize up to 9 stackable switches, 2 Chassis Switches today (more in the future), with Geographic Resiliency

All using industry 10GbE links..

@JezAtHP

juecker said...

Hi Jeremy,

Great article, but it lead me to some questions. It seems that you go through great lengths to eliminate the possibility of single points of failure or a single event causing an outage (fiber hits within a rack), which is great. I'm curious, what has caused downtime for you in the past? In my experience, it's usually people, but I'm wondering what you've run into.

Jacob
@juecker

Jeremy Filliben said...

juecker,

I agree, people are the primary cause. I've also seen loops caused by bad fiber and OS issues. Long ago I was bitten at least once by a server configured for bridging causing a loop. UDLD and BPDU Guard will prevent most of these issues now.

There was also the time when I tripped over the power cords for both of our core FDDI switches. But I suppose this goes under the "people" heading :)

sri said...

Juniper's Virtual Chassis technology on the EX8200 platform is the only solution in the industry that truly solves the 'split brain' scenario.
It can only do this by externalizing the control plane from the data plane elements on a scalable, robust platform and by providing multiple connectivity options that prevent any single failure from bringing down the entire system / aggregation layer.
All other solutions, VSS, vPC, SMLT etc. will lead to split brain scenario when the link between the two chassis goes down, unlike Juniper's VC solution

abnerg said...

Hi Jeremy,

As a prior commenter noted, the external route engines on Juniper's EX 8200 mitigate the split brain risk of a connection failure between the chassis. We recently published a white paper that goes into more detail on that architecture.

Deploying Virtual Chassis Technology on the EX8200 Line of Ethernet Switches

As always, let us know if you have other questions and thanks for continuing to look into these networking challenges.

Cheers,
Abner

Geert said...

Hi Jeremy,

Great article. I am interested in the VSS bug that bit you, because i am running VSS in our datacenter :-) and don't want to get bitten by the same bug :-)

Jeremy Filliben said...

Geert,

Unfortunately I don't recall the bug ID. IIRC, The issue was that ports on the secondary chassis (the one without the active supervisor) would not pass traffic. It was fairly catastrophic, but also easily noticeable during testing. We were running a very early version of SXH (maybe SXH or SXH2).

Anonymous said...

Good post. Just to add to your non-UPS point, I've experienced two full data centre outage incidents in the past which were directly caused by UPS failure (and I mean immediate hard lights-out failure, not just running out of juice).

Jeremy Filliben said...

ollyg,

Sadly enough, I've seen the same thing. I am moving towards the camp that says full-site UPSs are less than desirable. Smaller rack or even single device UPSs on critical devices would likely improve uptime when compared to the risks of a "whole room" UPSs.

Jeremy

parkprimus said...

Funny how you only comment on the question that are cisco centric. It seems to be that you are just a cisco junkie who is puking up things that you have read. After having seen the nexus crap and the 8200's in a vc with xre200 the juniper path is far superior. You like vPC to eliminate spanning tree but spanning tree is required to run vPC. You said that most error are caused by human, but have you seen what need to be command wise to build a vPC from a pair of Nexus 7000 to 5000. Nexus is setting you up for a disaster. It is much easier to build a LAG. Does't brocade use technology simular to vPC? Pretty sure it is called MCTP, there you go brocade finally caught up to cisco. Anyways, I digress. With Juniper 8200 VC solution you truly eliminate spanning tree, in fact you have to issue a command to enable spanning tree, and all you have to do for complete, non blocking redundancy is build a Link aggregation or port channel.

Jeremy Filliben said...

parkprimus,

I concede that I don't do a great job of keeping up with comments. I'm only comfortable adding my thoughts on technologies that I have direct personal experience with in production environments. That limits me to Cisco and Arista, when it comes to SPT-elimination technologies. There's enough misinformation available in other corners of the WWW; I'd rather not contribute even more ;)

To clarify re: vPC + SPT.. SPT is not a requirement when using vPC, but Cisco does recommend enabling it to avoid issues related to misconfiguration. A fully-enabled vPC environment will not have multiple paths, so SPT will not block anything.