Wednesday, August 25, 2010

HSRP, vPC and the vPC peer-gateway Command

My recent post concerning my Migration from Catalyst to Nexus received a number of interesting and helpful comments.  One comment from routerworld caused me to do a bit of research into the “vpc peer-gateway” command.  This blog post is a summation of that research.

How HSRP Works

Hot Standby Routing Protocol is a well-known feature of Cisco IOS.  The goal of HSRP is to provide a resilient default-gateway to hosts on a LAN.  This is accomplished by configuring two or more routers to share the same IP address and MAC address.  Hosts on the LAN are configured with a single default-gateway (either statically or via DHCP).
Upon sending its first packet to another subnet, the host ARPs for the MAC address of the default gateway.  It receives an ARP reply with the virtual MAC of the HSRP group.  The IP packet is encapsulated in an Ethernet frame with a destination MAC address of the default gateway.  If the primary router fails, HSRP keepalives are lost, and the standby HSRP router takes over the virtual IP address and MAC address.  The host does not need to know that anything has changed.

In the diagram above, the user (10.1.1.100) is configured with a default-gateway of 10.1.1.1.  When the user sends its first packet to 10.5.5.5, it ARPs for 10.1.1.1.  In my example, Router A is the HSRP primary router, so it sends an ARP reply with the virtual MAC address of 0000.0c07.AC05.  The User PC then encapsulates the IP packet (destination IP=10.5.5.5) in an Ethernet frame with a destination MAC address of 0000.0c07.AC05.  Router A accepts the frame and routes the packet.
The above paragraphs tell the story of packets coming from the HSRP-enabled LAN.  But what happens to reply packets coming from 10.5.5.5 to 10.1.1.100?  The answer is simple, and intuitive if you follow step-by-step.  First, the Server creates an IP packet with a destination of 10.1.1.100.  It encapsulates it in an Ethernet frame and forwards it to its default gateway (for this example, let’s say it is Router A).  Router A strips the Ethernet framing and determines the next hop is on the local subnet 10.1.1.0/24.  It encapsulates the packet in an Ethernet frame with a MAC address of 0021.6a98.1952.  The source MAC address is the physical MAC address of Router A (0024.F71E.3343).  Router A does not use the virtual MAC address for packets it routes onto the local subnet.

So, what is vPC?

Now that we’ve covered HSRP, let’s talk about Virtual Port Channeling (vPC).  vPC allows two NX-OS devices to share a port-channel.  Attached devices believe that they are connected to a single device via an etherchannel bundle.  This is great because it eliminates spanning-tree blocking along parallel paths.
To allow this to work, the paired NX-OS devices use two vpc-specific communication channels.  The first is a vpc peer-keepalive message.  This heartbeat lets one switch detect when the other has gone off-line, to prevent traffic from being dropped during a failure.  These are lightweight hello packets.
The second communication channel is the vpc peer-link.  This is a high-speed connection between the two NX-OS switches that is used to stitch together the two sides of the port-channel.  If a frame arrives on switch A, but is destined for a host on switch B, it is forwarded across the peer-link for delivery.  All things being equal, it is undesirable to forward frames across a vpc peer-link.  It is much better for the frame to be sent to the correct switch in the first place.  Of course, there’s no way for the attached device to know which path is more appropriate.

In the above example, the User PC is sending an Ethernet frame to the Server.  It creates a frame with a destination MAC address of 0033.9328.12A1 and sends it to the L2 Switch.  The L2 switch has an entry in his forwarding table indicating that the destination MAC is accessible via the Port-Channel 100 interface.  It uses its etherchannel load balancing hash algorithm to determine which physical interface to forward the frame onto.  It is equally likely that it will choose the link to Nexus B, even though the more efficient path is to Nexus A (someday TRILL will help us, but for now there is no solution).  If the frame is sent to Nexus B, it will forward the frame over the vPC peer-link to Nexus A.

Cisco’s current recommendation is to build the vPC peer-link with multiple dedicated 10GE connections for performance reasons.  Cisco also recommends that all devices in a vPC-enabled VLAN be connected to both Nexus switches.  In the diagram above, the Server is considered to be a vpc orphan port.  This is undesirable, since it requires usage of the vpc peer-link.  It also has implications with multicast traffic forwarding.

vPC and HSRP Together

Now we’ve arrived at the point where we can pull all this information together.  In the following diagram, the User PC has been moved to a new VLAN.  The user is again trying to communicate with the server.

The User PC ARPs for his default gateway.  Nexus A (the HSRP primary) replies with the virtual MAC address of 000.0C07.AC05.  The user creates an Ethernet frame with a destination address of the virtual MAC.  It then forwards the frame to the L2 Switch.  The L2 Switch uses its etherchannel load balancing algorithm to determine the physical link to use.  The difference is now that it doesn’t matter which link it uses.  The NX-OS switch on the other end will accept and route the packet.  In effect, both Nexus switches are HSRP active at the same time.  This is eliminates the need to forward Ethernet frames across the vPC peer-link for packets that are destined for other subnets.

What Does “vpc peer-gateway” Do?

If we left everything alone, the story would be complete.  Unfortunately, storage vendors thought it would be a good idea to optimize their handling of Ethernet frames.  Some NetApp and EMC equipment ignores the ARP reply given by the HSRP primary and instead forwards Ethernet frames to whichever MAC address it receives frames from.  This is nonstandard behavior.
Using the diagram above, let’s assume say that the User PC is now a EMC Celera storage device.  The Server sends its packets (IP destination 10.1.1.100) to Nexus B, which routes them to the Ethernet LAN.  All IP packets with source IP 10.5.5.5 will be encapsulated in Ethernet frames with a source MAC address of 0022.5579.F643.  The EMC Celera will cache the source MAC address of these frames, and when it has IP packets to send to 10.5.5.5, it will encapsulate them in Ethernet frames with a destination MAC of 0022.5579.F643.  It is choosing to ignore its default gateway for these outbound packets.
I suppose the theory behind this feature was to eliminate the extra hop within the LAN.  When HSRP is enabled, it is necessary to disable ICMP redirects.  This means that the routers will not inform hosts on the LAN that a better default-gateway is available for a particular destination IP address.  This storage feature saves a LAN hop.
Unfortunately, this optimization does not work well with vPC.  vPC relies on virtual MAC address sharing to reduce utilization across the vPC peer-link.  If hosts insist on addressing their frames to a specific router, suboptimal packet forwarding can occur.  According to Cisco, “Packets reaching a vPC device for the non-local router MAC address are sent across the peer-link and could be dropped by the built in vPC loop avoidance mechanism if the final destination is behind another vPC.”  At the application level we saw very poor performance due to these dropped packets.  Enough of the packets got through to allow access to the storage device, but file load times were measured in the tens of seconds, rather than milliseconds.
The “vpc peer-gateway” allows HSRP routers to accept frames destined for their vPC peers.  This feature extends the virtual MAC address functionality to the paired router’s MAC address.  By enabling this feature, NX-OS effectively disables the storage vendors’ optimization.

Conclusion

If you are running vPC and HSRP, and you have EMC or NetApp storage equipment, you probably need to add the “peer-gateway” command under your vpc configuration.  The only caveat to peer-gateway is the following (from NX-OS 5.0 - Configuring vPC):
Packets arriving at the peer-gateway vPC device will have their TTL decremented, so packets carrying TTL = 1 may be dropped in transit due to TTL expire. This needs to be taken into account when the peer-gateway feature is enabled and particular network protocols sourcing packets with TTL = 1 operate on a vPC VLAN.
I have yet to face this issue, so my recommendation is to add this to your vpc configuration as a default.

32 comments:

chris marget said...

Frustrating behavior on the part of the storage vendor, huh?

EMC calls this misfeature "packet reflect". Rather than sending an arp query to learn the gateway's MAC address, the EMC gear just swaps the order of the addresses in the L2 and L3 header (hence the reflection).

When I faced this issue with a Nexus + HSRP environment, I asked the storage guys to switch off packet reflect, rather than enable 'peer gateway' (which I was nervous about at the time).

Unknown said...

Nice post, I have yet to encounter this, or at least I hope that we haven't. Kinda sucks when people decide to step away from the standards..

Thanks

Unknown said...

Load balancers will also have the same issues. We have seen this on the F5 gear and i have rumors of it being necessary on the Netscaler stuff. We just started adding peer gateway by default as well. Nice post.

routerworld said...

Hi
Jeremy Thanks for the Post, We did encounter issue with EMC Storage. I did the packet capture between Nexus & EMC/NS-480 & you can see EMC gear is doing the reflection.

Once again, thank you for the Nice post.

Thank you
Viral/Routerworld

mlit said...

I would like to second Randy's comment regarding F5 Load Balancers. We had this issue with our F5s during our testing and it was quite the time hole.

Unknown said...

NetApp calls this behavior "FastPath". NetApp's documentation says it's purpose is to increase performance by avoiding lookups in the routing table and ARP cache. I imagine that EMC does it for similar reasons. It's easy enough to disable on a NetApp.

F5 calls it "auto last hop", and lists the same reason as NetApp, and also adds that it produces more correct behavior (such as eliminating asymmetric routing) in certain circumstances. But F5 admits that it interacts poorly with HSRP, and documents how to disable it.

We have established by observation that NetScalers also exhibit the problem, but have so far not had any luck tracking down any documentation on the behavior.

Unknown said...

Oh yeah, I forgot to mention that this behavior wreaks havoc when one of the HSRP routers fails. Now you have devices sending frames to a MAC address that is no longer on the air, with predictable results. We discovered this the hard way when we shut down routing on our B-side N7k, only to discover that our load balanced services provided by our NetScalers quit working.

To be honest, I can understand why these vendors implemented this feature. It's not like it doesn't have its uses. And it works well when HSRP isn't involved. The real problem comes when this feature isn't prominently documented, with big warnings about its failure modes in an environment with HSRP.

Nikhil Nemade said...

Very helpful post indeed. I currently have a TAC case open with Cisco about some intermittent connectivity loss through our vPC enabled Nexus switches and the Cisco Engineer is insistent on enabling peer-gateway. Your blog explained this setting very well.

Unknown said...

Hello, this is a great read; there is one item regarding the vpc setup that is confusing me.

Currently our two Nexus 7Ks are connected via an etherchanneled dot1q trunk utilizing 10gig ports. We are running HSRP between the two boxes.

Can this link become the vPC peer link? Is the vPC peer link able to then carryout the functions of the existing trunk? Or do I need a separate link to function as the vPC peer link?

Jeremy Filliben said...

David,

Brad Hedlund wrote a good post on this topic:

http://bradhedlund.com/2010/12/16/routing-over-nexus-7000-vpc-peer-link-yes-and-no/

The simplified version is that you can run a routing protocol between the 7Ks, but it is not a good idea to have a third routing device on the vPC peer-link interfaces.

Jeremy

Unknown said...

We just had a similar issue with Cisco 5508 WLCs. We have two, one connected to each Nexus 7K. We did this because Cisco does not officially support using the WLCs in a VPC. There is an open caveat (CSCtf27464) in the current 7.0 version of code in the WLCs (and also some previous versions) that force the management interface to respond to the physical MAC address and not the HSRP virtual MAC (so it behaves just like some of the aforementioned storage devices). We don't have the vpc peer-gateway feature enabled due to previous issues. We're going to re-enable the feature tonight to see if that fixes the issue.

Jason said...

Great Post! Unfortunately, I didn't know what I was looking for at the time so it took a call to the TAC to resolve this. Afterword, I was trying to understand fully what the peer-gateway command did and came across this post.

We were seeing the issue with F5 load balancers, and everything behind the load balancer. Our 7Ks have been in production for over a year now, but this is the first L3 HSRP interfaces we have put off them so this was a bit of a stumper. Thanks for the great explaination.

Geert said...

Great Great Read. Should be incorporated in the Cisco documentation i feel :-)

Ned said...

awesome post. I just have a question on the connectivity of the server in the 3d diagram. It seems that you are showing the server connected via a ethernet lan however is that assuming that the server is connected with 2 nic cards or is that actually showing it like a fex which is dual-homed to both 5Ks. If there is only 1 nic card on the server than is it still considered an orphan hence traffic will still flow over the vpc peer link or if it is in a fex connected in active/standby fashion than unless some nic teaming is done on the server that accepts frames for the mac of the active nic it will still have to flow over peer link. Pls advise if this is not the way it work. tx

Jeremy Filliben said...

Ned,

Thank you for the feedback.

I left the server-side of the network purposely vague, to keep the focus on the user PC side of the diagram. For the purposes of the diagram, you can assume the server is single-connected to a 3rd switch which is vPC-connected to the two Nexus switches. I think that will match up with the explanation I provided.

Tina said...

Great Article, Jeremy. Apparantly, we had this happen in our environment recently and i am reading your article after its been fixed.wish i had come across it earlier. Great read!

-- K said...

Hi, thank you from nice and clear article about HSRP and vPC, im not network specialist, but it's nice to know something wha we can face of our complex environment.

Thanks once more, I hope to see more articles about Nexus environments and that kind of things, keep on rocking

Mark said...

HI Jeremy,

Great explanation, one question. Originally I thought the issue was specific to the 7k, does the 5K now support peer-gateway?

Jeremy Filliben said...

Mark,

It does appear that this command is available on the Nexus 5500 platform.

http://www.cisco.com/en/US/docs/switches/datacenter/nexus5000/sw/layer2/503_n1_1/Cisco_n5k_layer2_config_gd_rel_503_N1_1_chapter8.html#task_F40A980F95E9439BA085E53FF7D2A07C

Jeremy

good times said...

Is the peer-gateway a command only to be entered on the switch pair that has the HSRP configured on it or throughout all pairs of VPC configured devices ie 5k's with only L2 uplinks?

locohost said...

What does Citrix call "auto last hop" on the Netscalers?

Network Pro said...

hi jeremy,

in the diagram above, with hsrp and vpc - say instead of the server (10.5.5.5) connected to both nexus i have a cisco 3750 connected to only one nexus (say nexus A) - how will this work ? as i beleive traffic is not allowed through the peer link until the secondary connection fails ?

i have a similar setup where the cisco 3750 connets to the internet however only a few end users can connect to the internet while the rest cant.

Network Pro said...
This comment has been removed by the author.
locohost said...

http://support.citrix.com/proddocs/topic/netscaler-advanced-networking-93-map/ns-nw-interfaces-configrng-mac-bsd-frwrdng-tsk.html

Jeremy Filliben said...

good times,

peer-gateway is only required on the devices which are participating in HSRP.

Jeremy

Jeremy Filliben said...

Network Pro,

Sorry for the delayed reply.. I hope you weren't waiting on my response!

If there is a NetApp/Citrix/EMC device connected to a downstream 3750, peer-gateway would still be required. If all the devices connected to the 3750 act correctly (no packets destined to the physical MAC address but using the HSRP IP address), peer-gateway would not be required.

Jeremy

Harish said...

Hi,

In our design, we are using a firewall as L3 device, however we have netapp servers between datacenter A & B and they are in the same subnet, netapp server ip address are 10.10.10.10/24(in A) and 10.10.10.20/24(in B) and gateway is 10.10.10.1. And netapp servers are connected via Nexus 7K and Data Centers are connected via one 10G link as of now. In Netapp while while copying files from A to B we have loss of keepalives and mismatched Ack's. If we enable Peer-Gateway in VPC, will solve our issue. Many Thanks for your response.

L2: connection

NETAPP <> NEXUS DC A<> NEXUS DCB <> NETAPP

Harish said...

Hi,

In our design, we are using a firewall as L3 device, however we have netapp servers between datacenter A & B and they are in the same subnet, netapp server ip address are 10.10.10.10/24(in A) and 10.10.10.20/24(in B) and gateway is 10.10.10.1. And netapp servers are connected via Nexus 7K and Data Centers are connected via one 10G link as of now. In Netapp while while copying files from A to B we have loss of keepalives and mismatched Ack's. If we enable Peer-Gateway in VPC, will solve our issue. Many Thanks for your response.

L2: connection

NETAPP <> NEXUS DC A<> NEXUS DCB <> NETAPP

andyo said...

Great article, man.
Can U pls clarify where drops are occured (vpc-peer who hasn't local DST-MAC in his table or vpc-peer who receive frame from its vpc-peer destined to its local DST-MAC"? And 'mI understand correctly term "local MAC" as a MAC which is located on the device behind at least one intermediate switch?

Jeremy Filliben said...

Andy,

The drops occur on the N7K on the far-end of the vPC peer-link. The loop prevention rules within vPC cause the receiving switch to drop frames which coming on the peer-link which have its own MAC address as the destination.

In my explanation, 'local MAC' refers to the hardcoded MAC address of the SVI on the switch. This is in contrast to the HSRP virtual MAC.

I hope this helps,
Jeremy

andyo said...

Jeremy, thank U for your solid explanation. please correct me if I'm wrong in thinking that this "rule of thumb" to drop frame arrived via vpc-link and destined to vpc-member is workinkg always even if either vpc-peer are expiriensing problem with its vpc-member peers?
I mean that if smth goes wrong with vpc-member link on either N7K and frame arrives for MAC placed behind failed vpc-member on that N7K and this N7K forwards frame to its vpc-peer, the latter will drop the frame?
Thank U very much for your patient.

Jeremy Filliben said...

Andy,

That's not the case. If a vpc member link fails and the N7K on the other end of the failed link receive a frame with that device's MAC, it will forward the frame over the vpc peer-link so the paired N7K can deliver it.

This link explains the situation. It refers to the N5K, but the N7K should operate in the same way.

http://www.cisco.com/en/US/docs/switches/datacenter/nexus5000/sw/operations/n5k_vpc_ops.html#wp425273

Jeremy