Wednesday, August 25, 2010

HSRP, vPC and the vPC peer-gateway Command

My recent post concerning my Migration from Catalyst to Nexus received a number of interesting and helpful comments.  One comment from routerworld caused me to do a bit of research into the “vpc peer-gateway” command.  This blog post is a summation of that research.

How HSRP Works

Hot Standby Routing Protocol is a well-known feature of Cisco IOS.  The goal of HSRP is to provide a resilient default-gateway to hosts on a LAN.  This is accomplished by configuring two or more routers to share the same IP address and MAC address.  Hosts on the LAN are configured with a single default-gateway (either statically or via DHCP).
Upon sending its first packet to another subnet, the host ARPs for the MAC address of the default gateway.  It receives an ARP reply with the virtual MAC of the HSRP group.  The IP packet is encapsulated in an Ethernet frame with a destination MAC address of the default gateway.  If the primary router fails, HSRP keepalives are lost, and the standby HSRP router takes over the virtual IP address and MAC address.  The host does not need to know that anything has changed.

In the diagram above, the user (10.1.1.100) is configured with a default-gateway of 10.1.1.1.  When the user sends its first packet to 10.5.5.5, it ARPs for 10.1.1.1.  In my example, Router A is the HSRP primary router, so it sends an ARP reply with the virtual MAC address of 0000.0c07.AC05.  The User PC then encapsulates the IP packet (destination IP=10.5.5.5) in an Ethernet frame with a destination MAC address of 0000.0c07.AC05.  Router A accepts the frame and routes the packet.
The above paragraphs tell the story of packets coming from the HSRP-enabled LAN.  But what happens to reply packets coming from 10.5.5.5 to 10.1.1.100?  The answer is simple, and intuitive if you follow step-by-step.  First, the Server creates an IP packet with a destination of 10.1.1.100.  It encapsulates it in an Ethernet frame and forwards it to its default gateway (for this example, let’s say it is Router A).  Router A strips the Ethernet framing and determines the next hop is on the local subnet 10.1.1.0/24.  It encapsulates the packet in an Ethernet frame with a MAC address of 0021.6a98.1952.  The source MAC address is the physical MAC address of Router A (0024.F71E.3343).  Router A does not use the virtual MAC address for packets it routes onto the local subnet.

So, what is vPC?

Now that we’ve covered HSRP, let’s talk about Virtual Port Channeling (vPC).  vPC allows two NX-OS devices to share a port-channel.  Attached devices believe that they are connected to a single device via an etherchannel bundle.  This is great because it eliminates spanning-tree blocking along parallel paths.
To allow this to work, the paired NX-OS devices use two vpc-specific communication channels.  The first is a vpc peer-keepalive message.  This heartbeat lets one switch detect when the other has gone off-line, to prevent traffic from being dropped during a failure.  These are lightweight hello packets.
The second communication channel is the vpc peer-link.  This is a high-speed connection between the two NX-OS switches that is used to stitch together the two sides of the port-channel.  If a frame arrives on switch A, but is destined for a host on switch B, it is forwarded across the peer-link for delivery.  All things being equal, it is undesirable to forward frames across a vpc peer-link.  It is much better for the frame to be sent to the correct switch in the first place.  Of course, there’s no way for the attached device to know which path is more appropriate.

In the above example, the User PC is sending an Ethernet frame to the Server.  It creates a frame with a destination MAC address of 0033.9328.12A1 and sends it to the L2 Switch.  The L2 switch has an entry in his forwarding table indicating that the destination MAC is accessible via the Port-Channel 100 interface.  It uses its etherchannel load balancing hash algorithm to determine which physical interface to forward the frame onto.  It is equally likely that it will choose the link to Nexus B, even though the more efficient path is to Nexus A (someday TRILL will help us, but for now there is no solution).  If the frame is sent to Nexus B, it will forward the frame over the vPC peer-link to Nexus A.

Cisco’s current recommendation is to build the vPC peer-link with multiple dedicated 10GE connections for performance reasons.  Cisco also recommends that all devices in a vPC-enabled VLAN be connected to both Nexus switches.  In the diagram above, the Server is considered to be a vpc orphan port.  This is undesirable, since it requires usage of the vpc peer-link.  It also has implications with multicast traffic forwarding.

vPC and HSRP Together

Now we’ve arrived at the point where we can pull all this information together.  In the following diagram, the User PC has been moved to a new VLAN.  The user is again trying to communicate with the server.

The User PC ARPs for his default gateway.  Nexus A (the HSRP primary) replies with the virtual MAC address of 000.0C07.AC05.  The user creates an Ethernet frame with a destination address of the virtual MAC.  It then forwards the frame to the L2 Switch.  The L2 Switch uses its etherchannel load balancing algorithm to determine the physical link to use.  The difference is now that it doesn’t matter which link it uses.  The NX-OS switch on the other end will accept and route the packet.  In effect, both Nexus switches are HSRP active at the same time.  This is eliminates the need to forward Ethernet frames across the vPC peer-link for packets that are destined for other subnets.

What Does “vpc peer-gateway” Do?

If we left everything alone, the story would be complete.  Unfortunately, storage vendors thought it would be a good idea to optimize their handling of Ethernet frames.  Some NetApp and EMC equipment ignores the ARP reply given by the HSRP primary and instead forwards Ethernet frames to whichever MAC address it receives frames from.  This is nonstandard behavior.
Using the diagram above, let’s assume say that the User PC is now a EMC Celera storage device.  The Server sends its packets (IP destination 10.1.1.100) to Nexus B, which routes them to the Ethernet LAN.  All IP packets with source IP 10.5.5.5 will be encapsulated in Ethernet frames with a source MAC address of 0022.5579.F643.  The EMC Celera will cache the source MAC address of these frames, and when it has IP packets to send to 10.5.5.5, it will encapsulate them in Ethernet frames with a destination MAC of 0022.5579.F643.  It is choosing to ignore its default gateway for these outbound packets.
I suppose the theory behind this feature was to eliminate the extra hop within the LAN.  When HSRP is enabled, it is necessary to disable ICMP redirects.  This means that the routers will not inform hosts on the LAN that a better default-gateway is available for a particular destination IP address.  This storage feature saves a LAN hop.
Unfortunately, this optimization does not work well with vPC.  vPC relies on virtual MAC address sharing to reduce utilization across the vPC peer-link.  If hosts insist on addressing their frames to a specific router, suboptimal packet forwarding can occur.  According to Cisco, “Packets reaching a vPC device for the non-local router MAC address are sent across the peer-link and could be dropped by the built in vPC loop avoidance mechanism if the final destination is behind another vPC.”  At the application level we saw very poor performance due to these dropped packets.  Enough of the packets got through to allow access to the storage device, but file load times were measured in the tens of seconds, rather than milliseconds.
The “vpc peer-gateway” allows HSRP routers to accept frames destined for their vPC peers.  This feature extends the virtual MAC address functionality to the paired router’s MAC address.  By enabling this feature, NX-OS effectively disables the storage vendors’ optimization.

Conclusion

If you are running vPC and HSRP, and you have EMC or NetApp storage equipment, you probably need to add the “peer-gateway” command under your vpc configuration.  The only caveat to peer-gateway is the following (from NX-OS 5.0 - Configuring vPC):
Packets arriving at the peer-gateway vPC device will have their TTL decremented, so packets carrying TTL = 1 may be dropped in transit due to TTL expire. This needs to be taken into account when the peer-gateway feature is enabled and particular network protocols sourcing packets with TTL = 1 operate on a vPC VLAN.
I have yet to face this issue, so my recommendation is to add this to your vpc configuration as a default.

Tuesday, August 24, 2010

Primo Hoagie Nutritional Information

(My apologies to anyone not interested in this.. it is completely off-topic from my usual network-related postings)

I recently decided to pay more attention to what I’m eating. My favorite hoagie shop is Primo Hoagie. I looked online for their nutritional information but I could not find it anywhere. The local shop was able to supply me with a printed sheet. There must be others who are interested in this information, so I decided to post it to my blog.

Primo-Sized Sandwiches (click for full-sized page):

Wraps (click for full-sized page):

Wednesday, August 11, 2010

A Brief Introduction to LISP

At Cisco Live 2009 I was introduced to the Locater/ID Separation Protocol (LISP).  I thought the idea seemed interesting, but I didn’t quite follow the practical purpose for it.  While planning my Cisco Live 2010 schedule I made sure to revisit this topic to get a better understanding of it.  After the breakout session and a few hours of experimenting, I believe I have a good feel for the issue LISP is attempting to solve, and the manner in which it intends to solve it.

By the way, I’m going to skip the requisite joke about the recursive programming language.  I will say that in college I spent a semester and a half programming in Scheme.  This first semester was an Introduction to Computer Science course, and wasn’t too bad.  The half semester was an advanced course called Artificial Intelligence.  It was a struggle for me to wrap my head around all the recursion.  When I reached the halfway point of the course, I asked the instructor about my outlook for a good grade.  She politely suggested I consider dropping the course :)

 

The IP Routing Problem

Before discussing LISP, it is useful to compare/contrast how DNS works versus how Internet IP routing works.  Both DNS and IP Routing deal with a large databases.  With DNS, the database is truly distributed.  End user DNS servers (for example, your corporate Internet-attached DNS server) are configured with specific names for their authoritative zones, plus enough information to allow them to look-up any other information they might need.  Random DNS entries (for example, www.cisco.com) are not pre-loaded into your DNS server.  If you need to resolve this name, the DNS server requests the information from an upstream server.  Caching adds some efficiency, but does not change the overall structure.  This system has allowed the number of DNS zones to scale well into the millions.

The database for IP routing is handled quite differently.  In the Internet’s Default Free Zone (DFZ), all routers must have the entire Internet routing database.  Summarization can help alleviate this requirement to a degree, but summarization also comes at the price of less accurate routing information.  The IPv4 routing database is currently 325,000 routes, give or take a few thousand.  Ultimately the IPv6 table will be as large, or more likely considerably larger, and the increased size of the address space will result in larger memory requirements.  And remember, this is high-speed router memory, not generic DRAM.  Wouldn’t it be great if we could transition IP routing from it’s current ‘replicated database’ model to a distributed database, like DNS?  As a matter of fact, that’s the goal of LISP.

 

Following a Connection in a LISP-Enabled World

Let’s go through a simple example in a fully LISP-enabled environment.  A user PC would determine the destination IP address of a web server via DNS and create a standard IP packet, with its own IP as the source and the IP of the web server as the destination.  The packet would be routed through the user’s LAN until it reached an Internet gateway.

 

Blog Post - LISP

At this point, the LISP-aware ISP router (in LISP-speak, the Ingress Tunnel Router, or ITR) would perform a lookup for the destination IP address.  An answer would come back with the IP address of the Egress Tunnel Router (ETR) for that destination.  The ITR would then encapsulate the user’s packet in a LISP packet, with a source IP of the ITR’s ISP interface and a destination of the ETR’s ISP interface.  This packet would then be sent through the Internet to the ETR.

The ETR receives the LISP-encapsulated packet, removes the header and routes the native IP packet (with the user’s PC as a source IP and the web server’s IP as a destination) into the local LAN, to the web server.  Return traffic from the web server to the user follows the same procedure in reverse (the web server’s ISP router acts as the ITR and the user’s ISP router is the ETR).

As a side note, the “Ingress” and “Egress” designations for Tunnel Routers are relative to the LISP-encapsulated tunnel.  The router that encapsulates a packet into LISP is the ingress router.

 

Tell Me Again… Why Is This Better?

On the surface this seems like extra work for little to no benefit.  Let’s dig a bit deeper to see how this helps each component in the path.

User PC – Nothing is different

ITR – This router only needs a default route to the Internet.  It is not clear from this example, but even if there is redundancy, in a fully LISP-enabled environment, only a default route is required.

Internet PE / P routers – This is where the magic happens.  The Internet PE and P routers only need routes for the ISP interfaces of the customer routers.  All packets between customers will be encapsulated, with the source and destination IP addresses coming from the WAN circuits.  Memory requirements are greatly reduced in these devices.

ETR – Only requires a default route to the Internet.

Web Server – Nothing is different

 

Drawbacks

So what are the drawbacks to LISP?  I can think of several:

MTU Issues – At its heart, LISP is a tunneling technology.  All tunneling technologies suffer from potential MTU issues.

Complexity – This paradigm is clearly different than what most of us are comfortable with.  But we didn’t enter this field thinking nothing would ever change, right?

Delay – The initial packet towards a new destination will likely get dropped, because the destination lookup takes time.  After the first packet the destination information is cached, so subsequent packets should flow without delay.  According to the Cisco LISP team, testing has shown this isn’t as big of an issue as it appears in theory.

A few optimizations have been suggested to deal with this delay issue.  First, a set of common destinations could be pre-programmed into ITRs (subnets associated with Google, for example).  Colin McNamara had a particularly interesting suggestion of ITRs performing DNS reply snooping, as it is highly likely that a DNS lookup will be followed quickly by an initial packet to that destination.  I’m not sure if this is being worked on, but it seems like a great idea.

 

What is Missing?

I glossed over the entire destination IP address lookup portion of LISP.  I will post a follow-up article describing this step.  For now, trust me when I say that it is not much different than DNS.

I also skipped over the other advantages of LISP.  For one, the ETR completely controls how inbound traffic is delivered to it.  If a destination IP address has multiple ISP gateways, those gateways can instruct ITRs to load-balance between destinations.  Experienced network engineers should immediately see the power of this feature.  This could signal the end of our rudimentary BGP-based load-balancing mechanisms (AS prepending, subnet splitting/disaggregation, etc).

A second benefit that LISP provides is its ability to send non-native traffic over a routed backbone.  In the example above I did not specify any of the IP addressing involved.  It is possible for the user PC and the web server to use IPv6, while the ISP network uses IPv4.  The ITR would receive an IPv6 packet from the user and perform a lookup which resolves to the IPv4 ISP address of the ETR.  The ITR would then encapsulate the IPv6 packet into an IPv4 LISP packet, which would then be sent over the ISP infrastructure.  When the ETR receives the LISP-encapsulated packet, it strips the header off and routes the IPv6 packet towards the web server.  This can also happen in reverse, where a pair of IPv4 speakers communicate over an IPv6 backbone.

Lastly, I completely bypassed the transitioning technologies to a fully LISP-enabled Internet.  They do exist, and it is very possible to deploy LISP incrementally.  We are long past the point of Flag Days.

 

What’s Next?

The LISP presentations I’ve seen have spent a lot of time describing the benefits of LISP for IPv4.  I’m not sure that is the correct place to focus.  As stated above, we are at about 325,000 IPv4 Internet routes.  At most I foresee a doubling in size of the routing table, to somewhere around 600,000 routes.  This factors in the usage of the 10% or so of remaining IPv4 space, as well as increased use of subnetting for load-balancing purposes.  I see the LISP protocol filling two needs.  First, it can be a great transitioning technology to IPv6, as well as a way to keep IPv4 alive over an IPv6-only infrastructure.  In an upcoming LISP blog post I will demonstrate how I am using LISP to reach the IPv6 Internet over an IPv4-only ISP.

Even more importantly, we are at the very beginning of the IPv6 route table explosion.  If the LISP team can get significant traction with LISP6, we can avoid the routing table bloat issue we’ve run into with IPv4.  Remember, the IPv4 routing table isn’t going anywhere, so we will only be adding to our problems with the deployment of IPv6.

If you are keeping track, I’ve promised two upcoming blog posts on LISP (my implementation experience and how the mapping database system works).  In the meantime, there are two useful resources online – lisp4.cisco.com and www.lisp4.net.  For a more scholarly take on LISP, see Petr Lapukhov’s article at iNE.com.

Wednesday, August 4, 2010

Migrating from Catalyst to Nexus

There are already a couple great resources on the Internet for network engineers who are migrating from a Catalyst-based data center to a Nexus-based data center. Two of my favorites are Cisco docwiki.cisco.com site, which hosts a bunch of IOS –> NX-OS comparison pages, and a page put together by Carole Warner Reece of Chesapeake Netcraftsmen. Both of these resources helped me quite a bit in working out the syntax of NX-OS commands. This blog post is an attempt to supplement that base of knowledge with my own experiences in converting a production data center over to the NX-OS platform.

My Catalyst network was a fairly standard data center design, with a pair of 6509s in the core and multiple Top-of-Rack switches cascaded below. We used RAPID-PVST, with blocking occurring in the middle of a TOR stack.

The new Nexus environment looks pretty much the same. We have a pair of Nexus 7010s in the core with a layer of Nexus 5020 switches at the edge. Each 5020 supports 4 – 6 FEXs. The FEXs only uplink to a single 5020 switch. This new network was built alongside the existing Catalyst one. The plan was to interconnect the two environments at the cores (at layer 2) and migrate server ports over whenever possible. Once we reached a critical mass of devices on the Nexus side of the network we planned to move the Layer 3 functionality from the IOS environment to the NX-OS side.

A few months ago we finished building out the Nexus LAN and interconnected it to the Catalyst LAN. We used vPC on the Nexus side to reduce the amount of SPT blocking. All was well, and we began migrating servers to the Nexus infrastructure without any issues. Eventually we reached our pre-determined “critical mass” and scheduled a change window to migrate the SVIs to the Nexus side and reconfigure the core Catalyst 6509s as Layer-2 only devices. The configuration work for this migration was around 1500 lines, so it was not by any means trivial, but it was also quite repetitive due to the number of SVIs and size of the third party static routing tables. Here’s where the fun began.

The first issue we ran into was with an extranet BGP peering connection through a firewall. In our design, we connect various third parties to an aggregation router in a DMZ. The routes for these third parties are advertised to our internal network via BGP, through a statically-routed firewall. Most of our third party connections also utilize BGP, so we receive a variety of BGP AS numbers. In two cases, the BGP AS number chosen by the third party overlaps with one of our internal AS numbers. To rectify this, we enabled the “allowas-in” knob on the internal BGP peering routers. Unfortunately this knob will not be available on the Nexus platform until NX-OS 5.1. I should have caught this in my pre-implementation planning. This was fixed with a small set of static routes. Our medium-term plan is to work with the two third parties to change their AS numbers, and eventually we will implement “allowas-in”, once we upgrade to NX-OS 5.1. Another interesting thing to note about BGP on NX-OS is that the routers check the AS-path for loops for both eBGP and iBGP neighbors. IOS does not do any loop-checking on iBGP advertisements.

With that behind us, we moved on to the SVIs and migrating our spanning-tree root to the Nexus switches. The SVI migration was trivial, but the SPT root migration caused issues. We were bitten by the behavior of Bridge Assurance, a default feature in NX-OS that was unavailable in our version of IOS (SXF train). Surprisingly, the lack of Bridge Assurance support didn’t prevent the Catalyst<->Nexus interconnect from working while the SPT root was on the Catalyst side of the network, but once we moved the root to the Nexus side, Bridge Assurance shut down the interconnects. The only acceptable solution to this issue (that I could find) was to disable Bridge Assurance globally on the Nexus switches. My error here is that I took for granted that my interconnect was properly configured, because it had been working for several months.

After encountering this issue I took another look at Terry Slattery’s blog post on Bridge Assurance, and at Cisco’s Understanding Bridge Assurance IOS Configuration Guide. The problem I experienced is Bridge Assurance requires switches to send BPDUs upstream (towards the root), while normal RSTP behavior is to suppress the sending of BPDUs towards the root. When the Catalyst side of the network contained the root, the Catalyst switches sent BPDUs downstream to the Nexus switches (normal RSTP behavior) and the Nexus switches sent BPDUs upstream to the Catalysts, which is abnormal for RSTP, but harmless. The Catalyst switches simply discarded the BPDUs. Once the SPT root was migrated, the Nexus switches sent BPDUs to the Catalyst (normal), and the Catalysts suppressed all BPDUs towards the Nexus switches (normal for RSTP, but not correct for Bridge Assurance). For the first few seconds, this was not a problem and forwarding worked fine, but eventually the Bridge Assurance timeout was reached and the Nexus switches put the ports into BA-Inconsistent state. The “right” way to solve this issue is to upgrade the Catalyst switches to SXI IOS and re-enabled Bridge Assurance. My preference is to simply retire the 6509s, so I’ll have to keep tabs on the migration effort. If it looks like it will drag on for a while, I’ll schedule the upgrades.

(Edit on 8/9/2010 - commenter "wojciechowskipiotr" noted that configuring the port-channels towards the 6500s with "spanning-tree port type normal" would also disable Bridge Assurance, but only for those specific ports. If I get an opportunity to try this configuration, I will report on whether it is successful.)

The remaining issues I faced were minor, but in some cases are still lingering or just annoyed me:

1) Static routes with “name” tags are unavailable. I had gotten into the habit of adding a named static routes to the network, especially for third-party routing. It appears that NX-OS does not support this.

2) VTP is unavailable. Based on conversations with other networkers, I’m probably the last living fan of VTP. I am sad to see it go. Fortunately in the Nexus environment there are fewer places to add VLANs (only the 5ks and 7ks).

3) Some of the LAN port defaults are different (when compared to IOS). For example, QoS trust is enabled by default. Also, “shutdown” is the default state for all ports. If a port is active, it’ll have a “no shutdown” on the config.

4) OSPF default reference bandwidth is now 40gb, rather than the 100mb value in IOS. This is a good thing, since 100mb is woefully low in today’s networks.

5) Proxy-arp is disabled by default. Our migration uncovered a few misconfigured hosts. Not a big deal, but it is noteworthy. Proxy-arp can be enabled per SVI, but do you really want to do it?

6) We ran into a WCCP bug in NX-OS 4.2(3). For some reason NX-OS is not load-balancing among our two WAN accelerators. Whichever WAE is activated last becomes the sole WAE that receives packets. I have an active TAC case to find a solution to this issue. For now, we are running through a single WAN accelerator, which reduces the effectiveness of our WAN acceleration solution.

I hope this helps someone with their own migration. This is going to be a common occurrence in our industry for the next few years, especially if Cisco has their way. If anyone has questions, please send me an email or post a comment.