Monday, September 28, 2009

Troubleshooting an Application Problem

I grew up knowing I would be a computer programmer… then I reached my University years and realized I had no passion for it.  I still pursued and attained my Computer Science degree, but I gravitated towards non-programming courses like Computer Networks, System Architecture and Telecommunication Systems.  Prior to losing my interest, I was quite good at it, and I believe my familiarity with the subject helps me to understand why Cisco IOS and other network OSs work the way they do.  My programming background also leads me to offer opinions on application development.

My first consulting project for RPM Consulting was for Philadelphia Newspapers, the owner of the Philadelphia Inquirer and Daily News papers.  Most of my efforts were focused on a new LAN design for their downtown office, which faced the familiar constraints of old, big city office buildings, like limited conduit space and union labor for all physical cable moves.  While this was interesting, I was drawn to the secondary project:  troubleshooting a new Unisys-provided newswire service application.

This custom-built application was intended to provide a centralized, searchable database of all news wire stories.  In the Unisys lab, and then again in the Inquirer lab, it worked perfectly.  Users could access story titles and abstracts, and if they were intrigued, a single click would retrieve the content.  When the system was deployed to the first batch of users, it failed miserably.  Downloading the story list was quick, but it took several minutes to download each individual story.  The application developer blamed the network (big surprise).  A colleague and I were able to borrow RPM’s Network General lunchbox sniffer for a day to get the data I needed to troubleshoot.  Our packet traces revealed that the application was built using UDP, not TCP, as I was expecting.

As our studies have taught us, TCP has built-in reliability, in the form of sequence numbers, acknowledgements and retransmissions.  UDP provides none of these features.  Application programmers who choose to use UDP must provide their own reliability mechanisms.  In this case, the developer coded the client application to send an acknowledgement once the full story was received.  If the story was not received in a predetermined time period (approximately two minutes), the client application repeated its request.  If a single UDP packet was dropped,this timeout value was the only mechanism for detecting the event.  This was fine for a lossless lab network, but in a production network, packets are sometimes dropped.  Especially in this particular network, which was in the midst of a Token-Ring to Ethernet conversion.

Due to the ongoing conversion project, there was very little I could do to mitigate the packet loss in the short term.  Our recommendation was for the application developer to rewrite the application to use TCP.  Fortunately, my colleague was a former Unisys employee, so that task fell to him.  The interim solution was to dial down the timeout value from two minutes to a few seconds.  I’m not certain either solution was accepted, as my participation in that consulting assignment ended before a solution to the application problem was implemented.

Monday, September 21, 2009

Defining a Quality of Service Policy

Let’s get this out of the way in the beginning.  Quality of Service is my favorite part of network design.  The fact that we now have enough power within our network devices to inspect packets and make on-the-fly DSCP value changes is amazing.  It doesn’t seem that long since we had to optimize our ACLs to keep the router CPU under 99%.  But remember:  With power comes responsibility.

Quality of service often gets a bad reputation for a pair of reasons:

  1. It requires micromanagement of applications
  2. It allows (or forces) network administrators to play favorites

Some would say it also gets a bad rep because it is often poorly implemented, but that wouldn’t happen to us, right? :)

Both of these are certainly possibilities, but a well-designed QoS policy doesn’t require either to be true.

 

My Current QoS Policy

I designed and led the implementation of a ten-class QoS policy for my organization.  Ten classes?!  I’m sure this seems like overkill, but there is a good reason for each class.  Here are the classes, plus the DSCP values associated with them:

Class Name DSCP Value(s) Traffic Type / Example
Internetwork-Control CS6 Routing Protocols
Telephony EF, CS3 VoIP, VoIP Signaling
Video-Conference AF41, AF42, AF43 Realtime Video Conferencing
Video-Live-Streaming AF31, AF32, AF33 Live Webcast
Video-Backbone CS4 Realtime Video Transfers
Transactional-Data AF21, AF22, AF23 Interactive Data
OAM CS2 Network Mgmt (SNMP, TACACS)
Bulk-Data AF11, AF12, AF13 Data Transfer Apps (FTP, CIFS)
Standard Default (DSCP=0) Uncategorized Traffic
Scavenger CS1 Undesired / Out of Spec Traffic

A bit of clarification around the three video classes is necessary.  We have two important video-based applications.  We have deployed a video-conferencing solution to a significant number of our locations.  This traffic is marked as AF41.  We also have scheduled live webcasts that consist of two traffic types.  The obvious one is the actual live video stream, which is delivered via multicast over our WAN to all remote locations.  This traffic is marked AF31.  The second type is the source material that creates the webcast.  These events often require remote camera feeds from other internal locations.  Originally, we would ‘roll a truck’ and use a satellite uplink to get the feed back to our HQ office.  Now, we attach the camera(s) to an encoder and convert the audio/video into a UDP packet stream.  This is then sent to our HQ site for A/V mixing to create the live stream.  It’s a great solution, and it has paid for itself many times over already (satellite trucks are expensive!).  The pre-work is significant on the network side, as we need to temporarily modify our QoS policy to accommodate the new stream, but it is worth the effort.

Let’s dedicate a paragraph to each of the classes to give an idea of why they exist.  These are loosely ordered based on importance, but as you will see, everything gets a bandwidth guarantee:

Internetwork Control

This class includes all of our routing protocol traffic, as well as the Lightweight Access Point (LWAPP) to Wireless LAN Controller (WLC) signaling traffic.  These are generally light traffic loads, so the queuing allocation can be small.

Telephony

We have chosen to combine voice and voice signaling traffic into a single traffic class.  I’ve even instructed our voice team to mark signaling traffic as EF.  We then overprovision the priority queue by 15% or so to allow for the extra traffic.

Video-Conference

Our video-conference equipment marks all stream traffic as AF41.  This traffic is RTP, and the application is very sensitive to dropped packets.  Because this application is real-time, it is also sensitive to delay and jitter.  While Cisco advises network admins to use a priority queue for their Telepresence system, I’ve been able to get by with standard Bandwidth allocation, which preserves our priority-queuing capabilities for the Telephony traffic.

Video-Live-Streaming

As mentioned above, the live video stream is delivered via multicast, using RTP packets.  Much like our Video-Conference traffic, this is highly sensitive to dropped packets.  Jitter and delay are less of a concern, due to the one-way nature of this stream.  It’s like watching TV… As long as the signal is solid, you don’t really care if it is delayed by a second or two.

Video-Backbone

This is very much like Video-Live-Streaming, except it flows in the opposite direction.  It is also a much higher-bandwidth stream.  We gave some thought to combining this with the Video-Live-Streaming class, but in the end we decided it was much clearer to give it a unique identifier.

Transactional-Data

This is the first of three end user application classes.  We mark all interactive applications with AF21.  I define interactive as a command/response relationship, like Telnet or Remote Desktop.  The user types or clicks something, then waits for a response before entering the next command.  Most (but not all) of these applications are low bandwidth.  They are generally delay sensitive, as the user is constantly monitoring the application.  This is where we place some of our internally-developed applications.

OAM

This queue is used for network management traffic.  Applications such as Netflow, SNMP and TACACS+ are placed in this queue.  Historically it has been difficult to get this traffic properly marked, as it usually sources from a router/switch, but this has gotten better with Cisco’s movement towards dedicated management ports on equipment.

Bulk-Data

This is the second end user application class.  All data transfer applications go into this queue.  These applications will often take as much bandwidth as is available.  They are not terribly sensitive to delay or jitter.  We further characterize these apps as time sensitive (AF11) and time insensitive (AF13).  This allows us to use a WRED policy to selectively drop packets from less critical transfers.  The time-insensitive category is great for NetApp SnapMirror, Windows background updates, patch deployments, etc.  This gives priority to user-based apps like Windows File Sharing.

Default

Anything that isn’t specifically marked into another queue falls into the Default queue.  Most HTTP-based apps are here.  The majority of my organization’s traffic falls into this queue.  It receives a healthy allocation of interface bandwidth as a result.

Scavenger

The Scavenger class is reserved for traffic that either exceeds normal network patterns or is known to be unnecessary.  We have defined a simple end user port policy of allowing any user to send up to 5mb/s of traffic into the network.  Any traffic which exceeds this value is marked as CS1.  This promotes a level of fairness between users.  This queue is given a 1% bandwidth allocation at all network chokepoints, so it is the first to go when there is congestion.  As for the ‘known unnecessary’ category, I’d prefer get rid of the application, rather than penalize the traffic.

 

What’s Missing?

Notice there isn’t a ‘Priority Applications’ queue in this model.  We purposely defined our queues based on traffic characteristics, not application priorities.  This has served us well, and has prevented the cajoling many experience when they ask their users to define ‘important’ applications.

Will this model work for everyone?  Probably not.  Every organization has a unique set of applications.  If you have already agreed to prioritize something, it can be difficult to back out of that promise.  If you haven’t done so, I suggest you avoid heading down that path.  Think hard about the characteristics of you traffic, and attempt to

In a future blog post I will show how this policy translates to a packet marking configuration, and later, an interface queuing policy.  I am a proponent of MPLS networks, which introduce another wrinkle to the equation.  None of my incumbent providers offers a ten class QoS scheme.  I will show what we’ve done to adapt our policy to those constraints.

Monday, September 14, 2009

Time for IPv6?

It is Fiscal Year 2010 budget time at my company.  We have just completed a rather expensive three-year cycle of hardware upgrades, so it is refreshing to look at the modest capital costs I am requesting for 2010.  Of course, depreciation is eating up a significant portion of our budget for the next couple of years, so I would be hesitant to ask for a large allocation this year anyway.

Every year since I started with my current employer, my potential project list begins with “Implement IPv6”.  And every year, I quickly move it to the following year’s potential project list.  While I pondered it a split-second longer this time around, I still moved it to the 2011 list.  I see no compelling reason to foist a new protocol on my employer over the next 12 – 16 months.

Please understand, I’m not dense when it comes to the need for IPv6 in the world.  I am able to comprehend IPv4 address exhaustion (I read Geoff Huston’s excellent blog at potaroo.net), and I see the benefits to IPv6.  In 2000, I successfully argued that it was in my (then) employer’s best interest to pay for my purchase of an IPv6 book, so I could digest the information and teach my consulting co-workers how to plan for the protocol.  “It can be a differentiator!” I explained.  Is there a more perfect project for the PDIM cycle? (Plan, Design, Implement, Manage, or feel free to substitute your own consulting methodology)  Much of the information in that book became obsolete, and I’m fairly certain I recycled it years ago.

So why not this year?  In brief, one of the following things must happen for me to push forward with an IPv6 project:

- A compelling application is released that requires or would benefit from IPv6.  Definitely not a reality yet.  I’m holding out hope that VMWare will realize that VMotion and IPv6 Mobile IP are a nice match.  So far, I haven’t heard anything.  (I really want to spend some time developing my thoughts on this into a coherent blog post)

- Our customers start clamoring for it.  We expect this to happen in our ASPAC businesses eventually, but nothing yet.  Maybe I’m naive, but I am confident that IPv6 users will have gateways into the IPv4 world.  After all, who would be willing to sign up with an ISP that can’t reach the ‘real’ Internet?  Even the then-mighty AOL had to abandon that business model is the 90s.

- Our business partners require it.  We’re in a regulated industry, so there is a fair amount of government and quasi-government involvement.  I assume they will be the first business partners to move to IPv6, but I have even heard of any inquiries about out IPv6 status.

To generalize, we’re not a bleeding edge infrastructure company.  We would not derive any benefit from being a first mover.  I like new technology as much as the next guy, but I have a responsibility to make logical, defensible recommendations to my business leaders.  That’s why “Implement IPv6” is moving to the 2011 project list.  Of course, I’m hedging my bet by including a “Verify IPv6 Compatibility” project on the 2010 list, just like last year, and the year before that, etc.  It’s a low-cost project, and if the need for IPv6 comes on suddenly, we’ll be prepared to meet it with code upgrades, not hardware purchases.

If you are planning your own 2010 projects, I encourage you to think through your own situation.  It may be different than mine, or you may have more work to do to prepare for IPv6.

Jeremy

Tuesday, September 8, 2009

Technical Writing.. Books or Blogs?

I’ve been thinking lately about whether it still makes sense to write technical books.  Years ago, I had a contract to write a book for Cisco Press.  Due to a job change, I was only able to complete a few chapters, and after handing it off to another author, it seemed to die a quiet death.  I used to be able to find a reference to it on one of the Amazon websites, but a recent search turned up nothing.

About six months ago, the writing bug bit me again, and I created a proposal for “Enterprise Network Designs” and sent it off to Cisco Press for feedback.  In the (long) time between sending that email and receiving a meaningful response, I chose instead to begin writing this blog.  Blogging fits my schedule better, and it provides a more interactive communication platform.  IIRC, I felt a good deal of deadline pressure and a bit of nervousness about making mistakes.  Once information is committed to paper, it’s difficult to fix it.

As a result of these events, I’ve been trying to determine if we’ve reached the end of the technical book publishing industry.  I don’t recall the Cisco Press contract being especially lucrative.. something like 10 – 15 percent of gross revenues go to the author, minus some expenses.  As several technical authors have mentioned, you don’t get into the field for the money.  Maybe they’re just trying to keep the competition out, but for some reason I doubt that’s the case.

Why wouldn’t technical authors go the blog route, and cut out the publishing middleman?  This would eliminate much of the overhead of publishing, as well as free the author from official deadlines.  Revenue can be generated by monetizing the blog, as well as follow-on contract work.  If the content is well-written and relevant, the professional prestige gained from the effort should be comparable to being a published author.  Technical book readers are by definition technical, so reading a blog should be well within their comfort zone.

 

What would the author be missing?  A few items are:

Deadlines - Is that good or bad?  Depends on the author, I suppose

Copy Editor – Go with a freelance editor?

Book Signings – Not sure a substitute is available for the blogger

Seeing Name in Print – Ditto.. no obvious substitute available

Copyright Protection – There must be a good solution to this, right?

Publisher Credibility – Cisco-related books from Cisco Press probably significantly outsell books from Pearson Publishing, even though they’re the same publishing house.  The lack of a well-known imprint could make it difficult to build an audience.

 

What are the advantages?

Full control over content

Easier publishing process (arguable, I suppose)

Infinite ability to revise content after publishing

Better interaction with readers

Better for the environment, if that is meaningful

 

I would think it would be relatively easy to monetize the content through eBooks, like the Kindle.  Some technical blogs are already syndicated on that platform, such as Jeremy Stretch’s Packet Life blog.  Thirty cents per month per user probably isn’t terribly lucrative, but for hassle-free revenue, why not?

Monetization Strategy

Google Adwords

eBook syndication

Professional consulting

Partnership with training vendor (depending on blog content)

 

 

I’d love to get some feedback, especially from authors (books and blogs).  Have you considered something like this, or are you already do it?  What are the challenges you’ve faced?