Monday, September 21, 2009

Defining a Quality of Service Policy

Let’s get this out of the way in the beginning.  Quality of Service is my favorite part of network design.  The fact that we now have enough power within our network devices to inspect packets and make on-the-fly DSCP value changes is amazing.  It doesn’t seem that long since we had to optimize our ACLs to keep the router CPU under 99%.  But remember:  With power comes responsibility.

Quality of service often gets a bad reputation for a pair of reasons:

  1. It requires micromanagement of applications
  2. It allows (or forces) network administrators to play favorites

Some would say it also gets a bad rep because it is often poorly implemented, but that wouldn’t happen to us, right? :)

Both of these are certainly possibilities, but a well-designed QoS policy doesn’t require either to be true.

 

My Current QoS Policy

I designed and led the implementation of a ten-class QoS policy for my organization.  Ten classes?!  I’m sure this seems like overkill, but there is a good reason for each class.  Here are the classes, plus the DSCP values associated with them:

Class Name DSCP Value(s) Traffic Type / Example
Internetwork-Control CS6 Routing Protocols
Telephony EF, CS3 VoIP, VoIP Signaling
Video-Conference AF41, AF42, AF43 Realtime Video Conferencing
Video-Live-Streaming AF31, AF32, AF33 Live Webcast
Video-Backbone CS4 Realtime Video Transfers
Transactional-Data AF21, AF22, AF23 Interactive Data
OAM CS2 Network Mgmt (SNMP, TACACS)
Bulk-Data AF11, AF12, AF13 Data Transfer Apps (FTP, CIFS)
Standard Default (DSCP=0) Uncategorized Traffic
Scavenger CS1 Undesired / Out of Spec Traffic

A bit of clarification around the three video classes is necessary.  We have two important video-based applications.  We have deployed a video-conferencing solution to a significant number of our locations.  This traffic is marked as AF41.  We also have scheduled live webcasts that consist of two traffic types.  The obvious one is the actual live video stream, which is delivered via multicast over our WAN to all remote locations.  This traffic is marked AF31.  The second type is the source material that creates the webcast.  These events often require remote camera feeds from other internal locations.  Originally, we would ‘roll a truck’ and use a satellite uplink to get the feed back to our HQ office.  Now, we attach the camera(s) to an encoder and convert the audio/video into a UDP packet stream.  This is then sent to our HQ site for A/V mixing to create the live stream.  It’s a great solution, and it has paid for itself many times over already (satellite trucks are expensive!).  The pre-work is significant on the network side, as we need to temporarily modify our QoS policy to accommodate the new stream, but it is worth the effort.

Let’s dedicate a paragraph to each of the classes to give an idea of why they exist.  These are loosely ordered based on importance, but as you will see, everything gets a bandwidth guarantee:

Internetwork Control

This class includes all of our routing protocol traffic, as well as the Lightweight Access Point (LWAPP) to Wireless LAN Controller (WLC) signaling traffic.  These are generally light traffic loads, so the queuing allocation can be small.

Telephony

We have chosen to combine voice and voice signaling traffic into a single traffic class.  I’ve even instructed our voice team to mark signaling traffic as EF.  We then overprovision the priority queue by 15% or so to allow for the extra traffic.

Video-Conference

Our video-conference equipment marks all stream traffic as AF41.  This traffic is RTP, and the application is very sensitive to dropped packets.  Because this application is real-time, it is also sensitive to delay and jitter.  While Cisco advises network admins to use a priority queue for their Telepresence system, I’ve been able to get by with standard Bandwidth allocation, which preserves our priority-queuing capabilities for the Telephony traffic.

Video-Live-Streaming

As mentioned above, the live video stream is delivered via multicast, using RTP packets.  Much like our Video-Conference traffic, this is highly sensitive to dropped packets.  Jitter and delay are less of a concern, due to the one-way nature of this stream.  It’s like watching TV… As long as the signal is solid, you don’t really care if it is delayed by a second or two.

Video-Backbone

This is very much like Video-Live-Streaming, except it flows in the opposite direction.  It is also a much higher-bandwidth stream.  We gave some thought to combining this with the Video-Live-Streaming class, but in the end we decided it was much clearer to give it a unique identifier.

Transactional-Data

This is the first of three end user application classes.  We mark all interactive applications with AF21.  I define interactive as a command/response relationship, like Telnet or Remote Desktop.  The user types or clicks something, then waits for a response before entering the next command.  Most (but not all) of these applications are low bandwidth.  They are generally delay sensitive, as the user is constantly monitoring the application.  This is where we place some of our internally-developed applications.

OAM

This queue is used for network management traffic.  Applications such as Netflow, SNMP and TACACS+ are placed in this queue.  Historically it has been difficult to get this traffic properly marked, as it usually sources from a router/switch, but this has gotten better with Cisco’s movement towards dedicated management ports on equipment.

Bulk-Data

This is the second end user application class.  All data transfer applications go into this queue.  These applications will often take as much bandwidth as is available.  They are not terribly sensitive to delay or jitter.  We further characterize these apps as time sensitive (AF11) and time insensitive (AF13).  This allows us to use a WRED policy to selectively drop packets from less critical transfers.  The time-insensitive category is great for NetApp SnapMirror, Windows background updates, patch deployments, etc.  This gives priority to user-based apps like Windows File Sharing.

Default

Anything that isn’t specifically marked into another queue falls into the Default queue.  Most HTTP-based apps are here.  The majority of my organization’s traffic falls into this queue.  It receives a healthy allocation of interface bandwidth as a result.

Scavenger

The Scavenger class is reserved for traffic that either exceeds normal network patterns or is known to be unnecessary.  We have defined a simple end user port policy of allowing any user to send up to 5mb/s of traffic into the network.  Any traffic which exceeds this value is marked as CS1.  This promotes a level of fairness between users.  This queue is given a 1% bandwidth allocation at all network chokepoints, so it is the first to go when there is congestion.  As for the ‘known unnecessary’ category, I’d prefer get rid of the application, rather than penalize the traffic.

 

What’s Missing?

Notice there isn’t a ‘Priority Applications’ queue in this model.  We purposely defined our queues based on traffic characteristics, not application priorities.  This has served us well, and has prevented the cajoling many experience when they ask their users to define ‘important’ applications.

Will this model work for everyone?  Probably not.  Every organization has a unique set of applications.  If you have already agreed to prioritize something, it can be difficult to back out of that promise.  If you haven’t done so, I suggest you avoid heading down that path.  Think hard about the characteristics of you traffic, and attempt to

In a future blog post I will show how this policy translates to a packet marking configuration, and later, an interface queuing policy.  I am a proponent of MPLS networks, which introduce another wrinkle to the equation.  None of my incumbent providers offers a ten class QoS scheme.  I will show what we’ve done to adapt our policy to those constraints.

1 comment:

John Graue said...

Very interesting post. Please continue with this thread cause aproaches of QOS in reality life like this is what are missing. Most books put things in the air and forget about things like, How can I transport 10 QoS clases when all the SP offer just 5?
Thanks Jeremy!

John Graue
CCIE# 21135