Monday, July 27, 2009

Preparing for Performance Routing

In my previous blog entry, I explained the role Performance Routing was expected to play in my network. This time around, I'd like to show the steps we've taken to prepare for its deployment.

My network core consists of three major locations. All three sites house significant users populations, while two of the locations also host our major North American data centers. The locations are interconnected with OC-3 & Ethernet technologies, and each site is connected to one of our of two MPLS providers. The previous iteration of our network design used EIGRP to route between the three core locations, with filtering and summarization to prevent an unnecessarily large routing table, especially at the edges.

Our MPLS-based WAN uses BGP. With careful route injection at the core locations, and a small amount of AS prepending for the default route, we were able to achieve a rudimentary but effective load-balancing scheme. More effective load-balancing could have been achieved by sending the same routing table through both providers and using "bgp bestpath as-path multipath-relax", but we had two significant limiting factors that made our providers less than equal. First, only one of our providers could route our multicast traffic. The second provider also had an onerous Quality of Service charge, so we treated their network as a best-effort path.

Here's a diagram of the previous network design:



"MPLS B" supported QoS and multicast, so we connected our "Core B" location to it, as that site housed the majority of our voice and video equipment.


Rethinking Assumptions

In my view, one of the most intriguing features of Performance Routing is the ability to inject new routes into a routing domain based on the network performance. This can be done via BGP or static. The alternative to route injection-based path manipulation is Policy-based routing (PBR), which makes me uncomfortable from a troubleshooting perspective. I can relatively easily track changes to my routing table, but how will I recreate a PBR-based network if I need to document a transient issue? Static-based route injection has promise, as these routes can be redistributed into any IGP, but for us, BGP made the most sense.

As we considered how best to implement PfR, I kept coming back to the design decision to assign individual AS numbers to our core locations. When PfR injects a BGP route, it sets the 'no-export' community, which prevents that route from being advertised to an eBGP peer. The right choice during our initial MPLS implementation was now a stumbling block in our potential PfR deployment. It was also clear that we needed matching QoS policies with our MPLS providers. For this reason, and several others, we chose to migrate to a new provider for "MPLS A".


Transitioning to a Real Core


We decided to take this opportunity to create an independent Core AS. While site-level BGP AS assignments worked well for our remote locations, the core is a special case. By carving out a small network consisting of core WAN devices and interconnect circuits, we have compartmentalized our core attachments. The following diagram provides a view of the new topology. As before, redundancy has been eliminated to provide clarity.





In the previous topology, as-path length dictated that packets would enter the nearest MPLS network when exiting a core location. In the new design, packets are free to enter and exit the core at the most appropriate point. Overlaying BGP-based PfR on this topology is relatively straightforward, and is nearly the next step in the project. We first have to deal with the placement of our Cisco WAAS WAN accelerators, which will be the topic of a later blog entry. Once WAN accelerator placement is dealt with, we'll be ready to start our PfR pilot.

Sunday, July 19, 2009

The Case for Performance Routing

The migration to layer 3 MPLS networks has reintroduced an old network convergence issue. How do we detect the failure of an end-to-end path? This issue has existed in the LAN since the advent of routing protocols. In the LAN, the question is “How do I detect the loss of my neighbor, when my interface does not fail?”

In the following diagram, Router B's link to Switch Z has failed.




How does Router A know that it no longer has a path to Router B? Depending on the routing protocol and platform, the answer could be loss of hellos or lack of BFD responses. Point-to-point LAN connections also solves this problem, which is why they are highly recommended wherever possible. In the case of point-to-point, the physical interface will drop, which gives an immediately actionable signal to the router.

How does this apply to L3 MPLS VPNs and wide area networks? Picture the carrier's MPLS network as a big switch. When the remote site router at the top of the following diagram loses its connection to MPLS A, it takes several seconds (at best) for the Core A router to lose its BGP routes. Running EIGRP or OSPF with the carrier doesn't help much either, as it is the carrier BGP propagation delay that governs the down time. EIGRP/OSPF is only the carrier's edge protocol; BGP is still the core protocol.



One solution to this issue is to build an overlay network based on static GRE tunnels or DMVPN and running an optimized IGP on it. My opinion is that these options add considerable complexity to what is a relatively simple topology. I try my best to avoid overlay networks, as they tend to increase troubleshooting time. Preserving the any-to-any connectivity that L3 VPNs provide requires a full mesh of GRE tunnels, or in the case of DMVPN, a minor delay while the dynamic tunnels are built. This solution also does not scale as well as a pure BGP-based MPLS environment.

Another solution to this issue is Performance Routing (PfR). This feature enables your Core A and Core B routers to monitor traffic flows through their respective MPLS providers. If packet loss or delay is detected, outbound traffic is dynamically redirected to the functioning path. A corresponding policy on the remote site router handles return traffic. Performance Routing currently has the ability to re-route traffic within three seconds, with plans to reduce that to one second. Performance Routing also detects and reacts to degraded paths, such as intermittent packet loss and high delay or jitter.

Performance Routing (and its predecessor/component, Optimized Edge Routing) has been around for 4-5 years. Cisco is still working to add new features and options. The engineer I spoke with at Cisco Live was quite eager to hear how I wanted to use it in my network, and took several of my suggestions on how to improve the technology. In fact, I owe him an email with the details… I’ll get on that right away!

Over the next few months, we’ll be implementing Performance Routing in our network. Our biggest need is to protect our voice paths. Like most organizations, we’ve practically eliminated dedicated voice trunks in our network. We have call center voice that flows over our MPLS backbone. When our carriers have issues, our first call is from the voice team. My expectation is that while PfR won’t prevent the first few seconds of degradation, it will be able to re-route our voice traffic much faster than our current methods. I’ll post the details of our implementation and results in subsequent blog entries.

Monday, July 13, 2009

Cisco Live 2009 Recap

Cisco Live Recap

Two weeks ago, I attended Cisco Live (Networkers) 2009 in San Francisco, CA. This is my third consecutive year attending Cisco's annual user conference. I was a two-time attendee awhile back (1999 in New Orleans & 2000 in Orlando) when it was a CCIE recertification requirement. Then I dropped off the social networking map for a half dozen years while I worked 12 hour days, including my commute. I didn't have time to attend training, or a company that would pay for it!

General Thoughts

This year's event seemed much smaller than the last two years, despite Cisco's attempts to make it feel otherwise. A simple view of the dining area made that pretty clear. It was maybe 1/2 the size of the one in Orlando. Anecdotally, I ran into far fewer former colleagues this year as well. In Anaheim, I couldn't walk from one session to another without getting caught up in conversation with a former co-worker. Not so this year.

Perhaps because of the lower attendance, I learned more this year than in Orlando. My personal preference is to have Cisco locate the event outside of a major tourist area. Anaheim, New Orleans and San Francisco were great; Orlando was tougher, probably because my family was nearby. Las Vegas should be interesting as well, as there are many distractions, or so I've heard!

Highlights

My favorite sessions this year were LISP and L2MP. Both gave glimpses into what network design is going to look like in the next 12 - 24 months, assuming ship dates don't slip! For me, L2MP is especially interesting, as my employer is currently building out a pair of Nexus-based Data Centers. In fact, I'm writing this on my way to Raleigh for a Nexus CPOC event. Like many, we'll be using vPC to take advantage of parallel links, but Spanning Tree is still going to block a number of uplinks. L2MP / TRILL will allow for much higher overall utilization of redundant DC links.

LISP is the future of Internet routing. It's primary goal is significantly reducing the size of the Internet routing table, while giving a secondary benefit of Internet tail circuit load balancing for everyone who signs up. There is some work to do on the pricing model, and the end user education component could be tricky, but once those are worked out, the solution should take off. It certainly appears that the basic technology has been hammered out.

Both LISP and TRILL deserve individual blog postings, which I hope to get to soon.

Keynotes

I was surprised to see that Guy Kawasaki was scheduled to give the closing keynote address. After two consecutive comedians, it felt like Cisco was getting a bit too practical. Fortunately I was way off-base. Guy Kawasaki was as funny as Ben Stein or John Cleese, and perhaps significantly more relevant to the tech audience. I still chuckle when thinking of his "Unique" and "Value" quadrant, especially the upper-left corner.

The two Cisco-based keynotes were interesting, but not especially 'new.' Most of the key points have been covered in previous events (30 market adjacencies, etc), and all of the technology was at least familiar. I guess that's the drawback of attending annually.

Customer Appreciation Event

This was probably the most disappointing part of the conference. I have to say that the Customer Appreciation Event seemed a little flat. Perhaps the 80s theme didn't quite appeal to me, although I am a fan of the decade. Devo was surprisingly energetic and fun, even if I only knew a couple of songs. The 70 minute, 4 mile bus ride might have warped my view of the event as well, so I might not be the best person to offer an opinion of the entertainment.

Takeaways

The biggest technology takeaway for me was that Performance Routing is definitely doable in our environment. I attended a pair of PfR sessions, and had a great one hour Meet the Engineer discussion where we covered our current L3 topology and figured out the best way to implement the technology. We're a typical two MPLS provider environment, and we've had more than our share of real-time traffic pain when our providers have issues. In its current implementation, Performance Routing can take our outage times down to 3 seconds, with plans to dial that down to 1 second in most cases. I will cover our implementation in future blog posts.

Other takeways include:

- IPv6 is still getting tons of hype.. I'm keeping abreast of the basic technology. No plans to implement or even pilot in the next 12 months though, so I didn't attend any IPv6 sessions. It's difficult to justify spending time on IPv6 without a business case for implementation.

- CCDE interest appears to be ramping up. That's a good thing, as I don't want the certification to 'die on the vine'. I was concerned that the low pass rate would drive interest down, but that doesn't seem to be the case.

- I'm looking forward to next year's event. The next time I attend, I'll have NetVet status, which should be interesting. I don't have anything specific to say to the Cisco CEO, but I've heard that the CCIE NetVet reception is worth attending, so I'll certainly make it part of my schedule. #cllv on twitter, or so I've heard.

Monday, July 6, 2009

Thoughts on the Cisco Certified Architect Announcement

Now that I've had a few days to investigate and digest the announcement, I have a few thoughts. First, the lack of an acronym is going to be frustrating. According to Cisco, 'CCA' is used by other Cisco products, so it would be confusing. In addition, CCA is the abbreviation for the Citrix Architect certification program. I guess we'll get over it. Perhaps Cisco can purchase Citrix (which I actually think would be a great idea for business reasons) to settle this!


Format


The board format is very interesting. As I understand the program at this time, the candidate will receive a packet of information, including an RFP, in the mail several weeks prior to the board meeting. On his/her own time, the candidate will prepare the necessary documentation to respond to the RFP. The RFP is then submitted, and the board meeting is conducted. The candidate will be judged on the RFP response itself, his/her ability to defend the response, and the candidate's ability to incorporate a last-minute change to the RFP.

This sounds rigorous, which is appropriate for a certification at this level. Because the RFP response is to be prepared off-site, I am certain it will be held to a high standard. This is very similar to my experience as a consultant, except that the deadline doesn't seem quite as tight as I recall in the real world. :) Cheating on this part of the process is possible, but because of the need to defend the RFP and incorporate a change during the board exam, it clearly would lead to failure.


Skills


The hard skills required for this certification closely mirror the CCDE. This is very much a test of soft skills, such as documentation ability, presentation skills, and ability to withstand the scrutiny and pressure of a board exam. It is highly unlikely that candidates for this exam will fail due to technical deficiencies.


Prerequisites


The two prerequisites are an active CCDE, and 10 years of experience. I agree with the need for a CCDE, as the new certification builds on the skills required for the CCDE. I don't agree with point of view that the CCIE should be a prerequisite, as the Cisco Certified Architect doesn't require any of the hard skills that the CCIE tests. That said, for the foreseeable future, all candidates for the Cisco Certified Architect program will have CCIEs anyway, so it's basically a moot point. While I don't know of all the CCDE candidates, the ones I am aware of all have their CCIEs.


The Cost


The published cost of this exam is $15,000. My initial reaction is that this will limit the exam to employees of consulting companies, as they have the most benefit to gain from advertising their employees' certifications. Eventually, this certification will be pursued and paid for by engineers who work directly for large companies. Like most other expensive educational pursuits, candidates need to see a payoff (nearly always financial) before they will be willing to spend money. Once the Cisco Certified Architect program receives significant market awareness, that payoff will exist in the form of higher salary and better job opportunities.

The Cisco Certified Architect program is being positioned as a MBA-equivalent, but there are three significant differences:

- An MBA program is education, with the by-product of a 'certification', in the form of a Master's degree. The Cisco Certified Architect program does not have an educational component, it is only a certification.

- MBA candidates have a very reasonable expectation of achieving their Master's degree when they commit to a program. Success in the Cisco Certified Architect program is far less certain, especially for the early applicants. It will be very interesting to see the success rate of the first wave, if that information is made public.

- Employers are quite willing to contribute to an MBA program, due to the educational component. It's a harder sell to ask them to pay for a certification.


How Does This Affect the CCIE Program?


There is a spirited discussion of the new cert on the Cisco Learning Network. The major point of contention there (and at the CCIE NetVet reception at Cisco Live, so I've heard) is that the cert is positioned 'above' the CCIE, but doesn't require the certification. The general complaint is that this devalues the CCIE program. It has been eleven years since I earned my CCIE in Routing & Switching. In that time, I believe the CCIE program has become marginally less valuable. It is simple economics: The rate of new CCIEs has exceeded the growth of the networking industry since approximately 1997. From 1997 to 2001, it didn't matter, as we started that era with so few CCIEs that even the large growth (over 1000 per year, where the rate was several hundred per year prior) did not soak up market demand. In 2001, the stock market crash finally affected CCIE employment to the point that CCIEs had difficulty finding work. Since then, the rate of CCIE generation has remained high, and probably slightly above the rate of CCIE job creation. Do you know any CCIEs that are unemployed or underemployed? That's anecdotal evidence of my theory.

Does this argue for making the CCIE more difficult, or reducing the rate of new CCIEs? My opinion is no. The CCIE program administrators have tried to keep the level of difficulty the same, so the cert would not be cheapened in any way. I don't see any reason why good engineers should be made to jump higher hurdles than I did to achieve their CCIEs.

Does the Cisco Certified Architect program devalue the CCIE? Perhaps slightly. Up to this point, employers have used CCIE numbers as a proxy for architecture ability. The lower the number, the more 'seasoned' the engineer. Coupled with a strong resume, this defined the engineer as an architect. If the CCDE and Cisco Certified Architect programs become well recognized in the industry, this will likely change. The irony is that the CCIE program never made any claims about an engineer's ability to design a network. From what I remember of my lab, I would not have classified it as even remotely well designed, with four routing protocols on eight devices! :) Without a specific standard, the industry found it necessary to identify network architects via this and other means. Cisco is now trying to rectify this situation, while at the same time solidifying its position as the thought leader in network architecture. This should eventually reduce bad hires for companies that require architects.


Overall Impression


This is a great step forward for Network Architects. It'll be a bumpy ride for the next few years, but eventually we'll wonder how we did without this differentiator.