by Dennis Jones

IT Troubleshooting: Quickly Identifying and Solving Software Bugs

Opinion
May 14, 200813 mins
DeveloperEnterprise ApplicationsOperating Systems

No software is perfect--who hasn't had a user uncover some hidden flaw--but these tips will help you debug efficiently.

Nearly every IT project manager, designer, DBA and developer wants to build the perfect software application: the seamless union of hardware and software, intuitive and robust, with eye-popping performance and rock-solid logic. While this pinnacle is difficult to reach&emdash;and flaws will be found—there are steps you can take to resolve them more quickly.

Countless hours can be spent gathering requirements, creating meticulous database and program design, and utilizing the very latest development tools and techniques. We can employ a seemingly endless array of unit, system, integration and regression test scripts, along with the finest implementation and training plans and procedures. Yet all of this massive effort and intent is simply no match for the one entity that reigns supreme when it comes to finding and exposing the most well hidden bug: the end user. Our customer. We might as well face the fact that no matter how many hours are spent bulletproofing code, end users are going to find problems. The tips below will provide the developer or technical support person with methods to quickly identify, verify, isolate and ultimately resolve such technical challenges. (Also read Seven Free Tools for PC Geeks–and One Quick Tip.)

Oh *#@$&!, We’ve Got a Production Problem!!

The words we hate (but are destined) to hear at some point. What to do? First things first, there must be a procedure in place to allow the user to properly describe and document the problem. Every production application should have a central help or support desk to be contacted when user issues arise. The help desk personnel are critical components to a thorough and complete resolution. As such, they should be functional experts on the system being supported, so they can interact intelligently with the users. They must obtain information and documentation on the entire issue, not just the error condition or message that the user ultimately received. What were all the steps taken? A screen print of any error messages should be obtained. These can prove invaluable when a developer is trying to piece together exactly what portions of code have been executed, and in what order. Users can sometimes leave out details that might be second nature to them, and a screen print may point out these details to the support person.

To learn more about process improvement and workflow, see ABC: An Introduction to Business Process Management and Workflow Gone Wrong.

Lucky You, This Is Now Your Issue to Resolve!

Hopefully it’s not a Friday afternoon where you have a lot of documentation to go through. Do that, and request confirmation of any ambiguities. You need to know exactly what steps were taken and the exact verbiage of any error message(s). Here’s where all of the time spent coding and testing that pesky error handling logic in your application can pay off. Thorough error handling logic is sometimes overlooked as a necessary part of an application. However, it is extremely important because more often than not, when an error condition does occur, it will be at a critical juncture and will need to be diagnosed and corrected quickly. In your application, you should be able to anticipate most error conditions, and thus handle them gracefully. Do so with a nice message to the user gently telling them how they, (and not your robust application), have somehow made an error. Do not be naive enough however, to think that errors you code for or handle will be the only ones that occur. You must also have “catch all” error logic to handle unexpected errors. Example: You can easily code error logic to inform your user that no records could be found based on some search criteria they entered. But what will happen, say, if the database goes down just as the user hits the “search button? Or, what if there is a power outage while your program is in the middle of saving records? How about when the user presses Ctrl-Shift-F8 while creating a new record, inserting a disc, and playing Solitaire in another window? You can bet on the fact that nearly every conceivable keystroke and concurrent program combination will eventually be attempted by your users. Your error-handling and commit logic must work together to not only capture information relative to error conditions encountered, but also preserve the integrity of your data in such events. (Also check out the podcast 20 Top Tips for Software Testing

It is an excellent idea to have a common error-handling routine that writes to an error log. Your “catch all” error logic can call this routine whenever unexpected errors occur. In this error log you can record the exact date and time, the name of the offending program and/or any subprograms, any pertinent record names or ID’s, and any error codes and text generated by either your application programs or by the database. Bottom line: Provide the user with an error message that means something to them (i.e., “An error has occurred while processing this record. Please contact the help desk immediately.”), and provide the support person (via the error log) with information that means something to them. All of this information is necessary because your initial goal is to recreate the error condition in your development or test environment. In general, error conditions must be recreatable in order to be correctable. Many times, error conditions that cannot be recreated are a result of a user who has forgotten some of the steps that were taken, or other circumstances that were present. These are the dreaded one-time problems that mysteriously go away by themselves. Guess what? Just as mysteriously, they tend to reappear by themselves at a later date, often with the same user. If you can speak directly to the person who received the error, do so. Go over all steps that led to the error. If possible, visit the location to examine the software, hardware and data. If a site visit is not feasible, an export of the user’s data can be very beneficial to resolving errors. Ask some additional questions. What other processes were running when the error condition was encountered? Any other unusual circumstances present? Were multiple error messages received? It is sometimes difficult to get the entire story, especially if the user feels that they have somehow made a mistake in the process. By being sensitive to this, and communicating your sincere desire to help, you can usually get all the details of the event.

Let’s Make This Error Happen!

OK, have all of the background information and documentation on the error condition in your hands. Now you need an environment where you can recreate the error condition without causing undue harm to any of your fellow developers, testers or (heaven forbid!) to your production users. Usually at any given time in a production system, there will be programs that are in the process of being modified, new modifications that are being tested and/or successfully tested changes waiting to be deployed. This can pose problems when trying to recreate an error condition from the field. Version control is also a big topic unto itself, so I’ll not delve too deeply into it here. Suffice it to say that you will need access to all programs and database objects currently in use at the location where the error occurred, and use this environment when attempting to recreate the condition. Typically, these programs and objects will be bundled into a particular release, which you should be able to load onto a test machine. It is absolutely critical that you work to recreate the error in the exact same environment, (with the probable exception of hardware), that your stricken user is in. If the issue is one of performance, research the hardware in use at the site in question and attempt to recreate the error with similar hardware. Do not attempt to recreate a “slow response” condition using a brand-new PC with a 3.2GHz processor and 4GB of RAM, unless your user has the same or nearly the same setup.

If an error message was received by the user, you should be able to locate the area in your code that displays this message. Look for conditional logic surrounding the error message (i.e., what conditions must exist for this error message to be displayed?). This will also help in determining exactly what steps were followed by the user to reach that block of code. Hopefully, what you discover coincides with the user’s account of their steps taken.

Some error conditions, however, are not as obvious as those that result in an error message being displayed. It may be that all appears well until a particular report or screen is accessed, and the data contained therein is misrepresented or inaccurate. This can be the result of “logic loopholes” that allow users to create, modify or delete data in a way that was not intended by the system designers. These loopholes can and should be minimized very early in the software development life cycle by performing negative testing. It’s very easy for developers and testers to just test processes for functionality as designed. However, test scripts should also include a healthy amount of steps unrelated to the actual functionality of the process. Many testers would perform negative testing before even attempting to test the intended functionality. If the process passed their “use all available fingers on the keyboard simultaneously” type of negative testing, they would then proceed with the positive testing. This may be time-consuming, but can pay huge dividends later.

Error Condition Recreated…Let’s Celebrate!!

Once you are able to consistently reproduce an error condition, you are well on your way to resolving it. Resolutions may be simple and confined to a single program (i.e., a cosmetic change to a GUI), or massive. Your analysis must be thorough enough to determine the scope of any resolution. Remain cognizant of any ripple effects your proposed change might cause. If you intend to change the way certain data is populated or manipulated, do a search of the entire system for this data to determine how your change might impact other system components. (Do not just do a search of the system documentation; instead search all of the physical source code, all database objects and any initialization, configuration, control files, etc.) Such a search should also be performed if it is deemed necessary to modify the database structure (i.e., adding new data elements or constraints, etc.). Failure to do complete analysis before making code changes of this sort leaves you vulnerable to introducing new issues, along with your fabulous fix. Nothing frustrates a user community more than a fix of one problem that introduces one or more new problems. It is one of the quickest ways to break down the trust that needs to exist between customer and support staff. Also, don’t fret if it has taken a considerable length of time to pinpoint flawed logic, only to find that it takes just a couple of moments to actually correct the code. Analysis, diagnosis and testing are the hard parts and will typically take much longer than modifying code. Please also remember that if “bad” data has been introduced into the application, that the resolution must not only address the process that allowed the “bad” data to exist, but also include some mechanism for correcting this data. And unfortunately, that mechanism typically will end up being some sort of data manipulation script that is outside of your application. These types of “fix-it” scripts can be very dangerous, so great care must be exercised when creating, testing and deploying such scripts.

Effort, Impact, Risks, Workarounds

OK, your analysis is complete. Depending on your IT departmental procedures, you may be required to provide an effort and impact estimate of your proposed resolutions. Many times there will be multiple possible resolutions, with some being more or less desirable than others based on many factors. (Providing more than one possible resolution can sometimes help reinforce your position on recommending a particular one.) If your primary resolution is not just a cut-and-dried simple program change, then it may be a good idea to create an effort/impact document regardless of whether or not it is a requirement of your shop. In this document, list each possible resolution, along with all programs/processes/database/objects/files/whatever that require modification. Provide an estimate of the effort involved in each segment of the resolution. This effort estimate should include time for coding, creation and execution of test scripts, possible rework based on testing results, documentation, implementation, and possible post-implementation support and training. Overall impact to the entire system must also be documented, broken down by each individual proposed change (i.e., if you propose modifying five programs, what is the impact of changing each of those five programs to other system components?). List any and all known risks associated with each resolution. Risk assessment is another area that can prove challenging. I’ve heard risk defined as a combination of uncertainty and constraint. Risks applying to software application changes can include, but are certainly not limited to: time constraints, manpower constraints, cost constraints, or uncertainty that impact has been completely identified.

If your resolution has a large impact, there may be reasons that management won’t want to forge ahead with it immediately. You could be at a critical point in the SDLC, perhaps attempting to baseline your application. In any event, it may be necessary for a workaround to be identified. Ideally, a workaround is a way that end users can avoid a known error condition by performing different steps or performing steps in a different order. If the error condition in question introduces “bad” data into your system, then you’d better have your “fix-it” script at the ready. Because invariably, no matter how good their intentions, your users will forget to follow the workaround steps and again encounter the error conditions. This is why a workaround should only be considered a temporary fix, and not remain in place for too long a period of time.

What It Means

Effective troubleshooting is one of the most key skills in the IT world, yet sadly many support personnel remain inadequate in this area. It can definitely minimize your turnaround time on issue resolutions, and help retain user confidence in your application and support staff. Given the proper amount of time, technique and information, virtually any error condition can be identified and resolved.

Dennis Jones is an IT professional working in the industry since 1989. He has served in many capacities from help desk and technical support, developer, DBA and technical manager. He is available for onsite consultation or speaking engagements.