Sunday, July 19, 2009

The Case for Performance Routing

The migration to layer 3 MPLS networks has reintroduced an old network convergence issue. How do we detect the failure of an end-to-end path? This issue has existed in the LAN since the advent of routing protocols. In the LAN, the question is “How do I detect the loss of my neighbor, when my interface does not fail?”

In the following diagram, Router B's link to Switch Z has failed.




How does Router A know that it no longer has a path to Router B? Depending on the routing protocol and platform, the answer could be loss of hellos or lack of BFD responses. Point-to-point LAN connections also solves this problem, which is why they are highly recommended wherever possible. In the case of point-to-point, the physical interface will drop, which gives an immediately actionable signal to the router.

How does this apply to L3 MPLS VPNs and wide area networks? Picture the carrier's MPLS network as a big switch. When the remote site router at the top of the following diagram loses its connection to MPLS A, it takes several seconds (at best) for the Core A router to lose its BGP routes. Running EIGRP or OSPF with the carrier doesn't help much either, as it is the carrier BGP propagation delay that governs the down time. EIGRP/OSPF is only the carrier's edge protocol; BGP is still the core protocol.



One solution to this issue is to build an overlay network based on static GRE tunnels or DMVPN and running an optimized IGP on it. My opinion is that these options add considerable complexity to what is a relatively simple topology. I try my best to avoid overlay networks, as they tend to increase troubleshooting time. Preserving the any-to-any connectivity that L3 VPNs provide requires a full mesh of GRE tunnels, or in the case of DMVPN, a minor delay while the dynamic tunnels are built. This solution also does not scale as well as a pure BGP-based MPLS environment.

Another solution to this issue is Performance Routing (PfR). This feature enables your Core A and Core B routers to monitor traffic flows through their respective MPLS providers. If packet loss or delay is detected, outbound traffic is dynamically redirected to the functioning path. A corresponding policy on the remote site router handles return traffic. Performance Routing currently has the ability to re-route traffic within three seconds, with plans to reduce that to one second. Performance Routing also detects and reacts to degraded paths, such as intermittent packet loss and high delay or jitter.

Performance Routing (and its predecessor/component, Optimized Edge Routing) has been around for 4-5 years. Cisco is still working to add new features and options. The engineer I spoke with at Cisco Live was quite eager to hear how I wanted to use it in my network, and took several of my suggestions on how to improve the technology. In fact, I owe him an email with the details… I’ll get on that right away!

Over the next few months, we’ll be implementing Performance Routing in our network. Our biggest need is to protect our voice paths. Like most organizations, we’ve practically eliminated dedicated voice trunks in our network. We have call center voice that flows over our MPLS backbone. When our carriers have issues, our first call is from the voice team. My expectation is that while PfR won’t prevent the first few seconds of degradation, it will be able to re-route our voice traffic much faster than our current methods. I’ll post the details of our implementation and results in subsequent blog entries.

3 comments:

ryan said...

Sounds like a pretty cool idea Jeremy. I will be interested in learning more about your implementation.

Unknown said...

Hi Jeremy, thx for the nice article. I had a question on one of your comments under the Rethinking Assumptions section where you seem to say that assigning individual AS numbers to your sites was the right choice.

" ... I kept coming back to the design decision to assign individual AS numbers to our core locations. .......

The right choice during our initial MPLS implementation was now a stumbling block in our potential PfR deployment."

I am in a similar situation where I am being asked to request the carrier to do an AS-override however I am not comfortable with allowing the carrier to break standard BGP loop prevention. I was wondering if you can possibly list some of the reasons why you think assigning individual AS numbers per site was the right choice even though it seems you had to change it for your CORE. Another question was did you assign individual AS numbers for all your Branch sites or only for the large locations. I am considering assigning an AS per location no matter how large or small. Thx and look forward to your posts.

Jeremy Filliben said...

vik,

I'll write a new blog post to address your question in the next week or so. This is a rough week, due to travel.

Thanks for your interest!
Jeremy