Monday, September 28, 2009

Troubleshooting an Application Problem

I grew up knowing I would be a computer programmer… then I reached my University years and realized I had no passion for it.  I still pursued and attained my Computer Science degree, but I gravitated towards non-programming courses like Computer Networks, System Architecture and Telecommunication Systems.  Prior to losing my interest, I was quite good at it, and I believe my familiarity with the subject helps me to understand why Cisco IOS and other network OSs work the way they do.  My programming background also leads me to offer opinions on application development.

My first consulting project for RPM Consulting was for Philadelphia Newspapers, the owner of the Philadelphia Inquirer and Daily News papers.  Most of my efforts were focused on a new LAN design for their downtown office, which faced the familiar constraints of old, big city office buildings, like limited conduit space and union labor for all physical cable moves.  While this was interesting, I was drawn to the secondary project:  troubleshooting a new Unisys-provided newswire service application.

This custom-built application was intended to provide a centralized, searchable database of all news wire stories.  In the Unisys lab, and then again in the Inquirer lab, it worked perfectly.  Users could access story titles and abstracts, and if they were intrigued, a single click would retrieve the content.  When the system was deployed to the first batch of users, it failed miserably.  Downloading the story list was quick, but it took several minutes to download each individual story.  The application developer blamed the network (big surprise).  A colleague and I were able to borrow RPM’s Network General lunchbox sniffer for a day to get the data I needed to troubleshoot.  Our packet traces revealed that the application was built using UDP, not TCP, as I was expecting.

As our studies have taught us, TCP has built-in reliability, in the form of sequence numbers, acknowledgements and retransmissions.  UDP provides none of these features.  Application programmers who choose to use UDP must provide their own reliability mechanisms.  In this case, the developer coded the client application to send an acknowledgement once the full story was received.  If the story was not received in a predetermined time period (approximately two minutes), the client application repeated its request.  If a single UDP packet was dropped,this timeout value was the only mechanism for detecting the event.  This was fine for a lossless lab network, but in a production network, packets are sometimes dropped.  Especially in this particular network, which was in the midst of a Token-Ring to Ethernet conversion.

Due to the ongoing conversion project, there was very little I could do to mitigate the packet loss in the short term.  Our recommendation was for the application developer to rewrite the application to use TCP.  Fortunately, my colleague was a former Unisys employee, so that task fell to him.  The interim solution was to dial down the timeout value from two minutes to a few seconds.  I’m not certain either solution was accepted, as my participation in that consulting assignment ended before a solution to the application problem was implemented.

No comments: