Software developers have access to helpful debugging tools that let us step through the logic of our code and discover lines of code that might be misbehaving or causing problems. If our code is generating an error, the debugger will tell us which line of code is at fault. If we know our data, we know what results to expect and we can work through the code to diagnose problems if we don’t get the expected results.
Diagnosing problems during software development is relatively easy. The real challenge comes when your software is in operation and the end user has an issue–users report the effects of a problem and we, as developers, have to diagnose the root of the problem from the user’s description.
In the health care industry, they often remind doctors to cure the disease, not the symptoms. The same should apply to software developers; it’s very easy to get mislead by symptoms. My favorite “explanation” from an end user is “It worked before and now it’s not working.” This vague statement always sends developers down a path of investigating what changed between when the software worked and when it didn’t. Often, this investigation turns out to be a waste of time.
Let me give you some real life examples of being led down the wrong path thanks to the wrong focus:
My friend, a staff engineer for Delphi, had to diagnose a problem with overheating vehicles. The company’s customer service staff analyzed all the issues reported and came to the conclusion that most of the cars reporting the overheating problem were yellow in color. The engineering department was misled down the path of studying whether paint color had an effect on the temperature of a vehicle’s engine. Later, they discovered that the overheating was caused when the vehicle’s engines idled for too many hours during the day. Many of the cars that reported the problem were cabs in New York City. Since most of the cabs spend a fair amount of time idling while waiting for passengers and most NYC cabs are yellow, the customer service team’s analysis of the situation was accurate (many cars with overheating issues were the same color) but it didn’t get to the root of the problem (idling time).
I once wrote an application for processing reversal of payment transactions. After the application was in use for a while I received complaints that it didn’t work when processing reversals that resulted from bounced checks. I knew I’d tested for this scenario, so I was upset to learn that something was going wrong in its “real life” use. I found out later that a new staff member had been assigned to process the bounced check reversals. This new employee wasn’t properly trained and was mistakenly entering check numbers in the account number column, and vice versa. The problem wasn’t with bounced checks; it was user error. But I wasted a lot of time researching the bounced check stuff before we discovered the real issue.
This leads me to a line one of my Operational Analysis professors used to use: “Problems don’t float, only their effects come to the surface.” Don’t get misled by how a problem is described to you, instead do your own investigation; build your own understanding. Don’t just skim the surface; you have to dive deep if you want to find the root cause of a problem.
Here are few basic steps I’ve always found helpful when diving deep:
1. Collect examples of where the application malfunctioned and make sure you don’t limit the examples based on how the problem was defined to you. Do some analysis of your own to see which data sets produced the wrong results.
2. Compare the data to the source document to eliminate any data entry issues. Don’t assume anything. I remember one application where data fields were mislabeled during a configuration step, causing the wrong data to appear in the wrong columns.
3. Make sure that business requirements haven’t changed. Try not to offend the user reporting the problem, but make sure the output they expected matches the business requirements you followed to build the application. I remember an instance where a payroll auditor held back hundreds of paychecks, declaring that salary calculations were wrong. Many hours were wasted verifying the software used to process the calculations, only to discover that the payroll auditor hadn’t been informed that pay raises had gone into effect.
4. Work through the application using the data from the examples you collected. Make sure each step of the way that the output matches the business requirements. Again don’t assume anything, work through each and every step. Not only will this help you diagnose the problem reported, you might discover some other malfunction that was overlooked during development time.
And finally, remember that once you do uncover the root of the problem, you congratulate the end user for discovering that special “hidden feature” in your application!