My recent post concerning my Migration from Catalyst to Nexus received a number of interesting and helpful comments. One comment from routerworld caused me to do a bit of research into the “vpc peer-gateway” command. This blog post is a summation of that research.
How HSRP Works
Hot Standby Routing Protocol is a well-known feature of Cisco IOS. The goal of HSRP is to provide a resilient default-gateway to hosts on a LAN. This is accomplished by configuring two or more routers to share the same IP address and MAC address. Hosts on the LAN are configured with a single default-gateway (either statically or via DHCP).
Upon sending its first packet to another subnet, the host ARPs for the MAC address of the default gateway. It receives an ARP reply with the virtual MAC of the HSRP group. The IP packet is encapsulated in an Ethernet frame with a destination MAC address of the default gateway. If the primary router fails, HSRP keepalives are lost, and the standby HSRP router takes over the virtual IP address and MAC address. The host does not need to know that anything has changed.
In the diagram above, the user (10.1.1.100) is configured with a default-gateway of 10.1.1.1. When the user sends its first packet to 10.5.5.5, it ARPs for 10.1.1.1. In my example, Router A is the HSRP primary router, so it sends an ARP reply with the virtual MAC address of 0000.0c07.AC05. The User PC then encapsulates the IP packet (destination IP=10.5.5.5) in an Ethernet frame with a destination MAC address of 0000.0c07.AC05. Router A accepts the frame and routes the packet.
The above paragraphs tell the story of packets coming from the HSRP-enabled LAN. But what happens to reply packets coming from 10.5.5.5 to 10.1.1.100? The answer is simple, and intuitive if you follow step-by-step. First, the Server creates an IP packet with a destination of 10.1.1.100. It encapsulates it in an Ethernet frame and forwards it to its default gateway (for this example, let’s say it is Router A). Router A strips the Ethernet framing and determines the next hop is on the local subnet 10.1.1.0/24. It encapsulates the packet in an Ethernet frame with a MAC address of 0021.6a98.1952. The source MAC address is the physical MAC address of Router A (0024.F71E.3343). Router A does not use the virtual MAC address for packets it routes onto the local subnet.
So What is vPC?
Now that we’ve covered HSRP, let’s talk about Virtual Port Channeling (vPC). vPC allows two NX-OS devices to share a port-channel. Attached devices believe that they are connected to a single device via an etherchannel bundle. This is great because it eliminates spanning-tree blocking along parallel paths.
To allow this to work, the paired NX-OS devices use two vpc-specific communication channels. The first is a vpc peer-keepalive message. This heartbeat lets one switch detect when the other has gone off-line, to prevent traffic from being dropped during a failure. These are lightweight hello packets.
The second communication channel is the vpc peer-link. This is a high-speed connection between the two NX-OS switches that is used to stitch together the two sides of the port-channel. If a frame arrives on switch A, but is destined for a host on switch B, it is forwarded across the peer-link for delivery. All things being equal, it is undesirable to forward frames across a vpc peer-link. It is much better for the frame to be sent to the correct switch in the first place. Of course, there’s no way for the attached device to know which path is more appropriate.
In the above example, the User PC is sending an Ethernet frame to the Server. It creates a frame with a destination MAC address of 0033.9328.12A1 and sends it to the L2 Switch. The L2 switch has an entry in his forwarding table indicating that the destination MAC is accessible via the Port-Channel 100 interface. It uses its etherchannel load balancing hash algorithm to determine which physical interface to forward the frame onto. It is equally likely that it will choose the link to Nexus B, even though the more efficient path is to Nexus A (someday TRILL will help us, but for now there is no solution). If the frame is sent to Nexus B, it will forward the frame over the vPC peer-link to Nexus A.
Cisco’s current recommendation is to build the vPC peer-link with multiple dedicated 10GE connections for performance reasons. Cisco also recommends that all devices in a vPC-enabled VLAN be connected to both Nexus switches. In the diagram above, the Server is considered to be a vpc orphan port. This is undesirable, since it requires usage of the vpc peer-link. It also has implications with multicast traffic forwarding.
vPC and HSRP Together
Now we’ve arrived at the point where we can pull all this information together. In the following diagram, the User PC has been moved to a new VLAN. The user is again trying to communicate with the server.
The User PC ARPs for his default gateway. Nexus A (the HSRP primary) replies with the virtual MAC address of 000.0C07.AC05. The user creates an Ethernet frame with a destination address of the virtual MAC. It then forwards the frame to the L2 Switch. The L2 Switch uses its etherchannel load balancing algorithm to determine the physical link to use. The difference is now that it doesn’t matter which link it uses. The NX-OS switch on the other end will accept and route the packet. In effect, both Nexus switches are HSRP active at the same time. This is eliminates the need to forward Ethernet frames across the vPC peer-link for packets that are destined for other subnets.
What Does “vpc peer-gateway” Do?
If we left everything alone, the story would be complete. Unfortunately, storage vendors thought it would be a good idea to optimize their handling of Ethernet frames. Some NetApp and EMC equipment ignores the ARP reply given by the HSRP primary and instead forwards Ethernet frames to whichever MAC address it receives frames from. This is nonstandard behavior.
Using the diagram above, let’s assume say that the User PC is now a EMC Celera storage device. The Server sends its packets (IP destination 10.1.1.100) to Nexus B, which routes them to the Ethernet LAN. All IP packets with source IP 10.5.5.5 will be encapsulated in Ethernet frames with a source MAC address of 0022.5579.F643. The EMC Celera will cache the source MAC address of these frames, and when it has IP packets to send to 10.5.5.5, it will encapsulate them in Ethernet frames with a destination MAC of 0022.5579.F643. It is choosing to ignore its default gateway for these outbound packets.
I suppose the theory behind this feature was to eliminate the extra hop within the LAN. When HSRP is enabled, it is necessary to disable ICMP redirects. This means that the routers will not inform hosts on the LAN that a better default-gateway is available for a particular destination IP address. This storage feature saves a LAN hop.
Unfortunately, this optimization does not work well with vPC. vPC relies on virtual MAC address sharing to reduce utilization across the vPC peer-link. If hosts insist on addressing their frames to a specific router, suboptimal packet forwarding can occur. According to Cisco, “Packets reaching a vPC device for the non-local router MAC address are sent across the peer-link and could be dropped by the built in vPC loop avoidance mechanism if the final destination is behind another vPC.” At the application level we saw very poor performance due to these dropped packets. Enough of the packets got through to allow access to the storage device, but file load times were measured in the tens of seconds, rather than milliseconds.
The “vpc peer-gateway” allows HSRP routers to accept frames destined for their vPC peers. This feature extends the virtual MAC address functionality to the paired router’s MAC address. By enabling this feature, NX-OS effectively disables the storage vendors’ optimization.
If you are running vPC and HSRP, and you have EMC or NetApp storage equipment, you probably need to add the “peer-gateway” command under your vpc configuration. The only caveat to peer-gateway is the following (from NX-OS 5.0 - Configuring vPC):
Packets arriving at the peer-gateway vPC device will have their TTL decremented, so packets carrying TTL = 1 may be dropped in transit due to TTL expire. This needs to be taken into account when the peer-gateway feature is enabled and particular network protocols sourcing packets with TTL = 1 operate on a vPC VLAN.
I have yet to face this issue, so my recommendation is to add this to your vpc configuration as a default.