A Fundamental Rule of Troubleshooting Software Bugs

In a previous post, I talked about a really annoying bug:

“Quantity 127546.00 for asset Blah in System A does not match Quantity 127546 for asset Blah in System B”

I didn’t want to go through and list out all of the different log messages and whatnot, so I paraphrased them into that sentence.

And even though I know better, I thus violated a fundamental rule of troubleshooting.  That wasn’t, obviously, the bug.

When confronting a bug, and you have log messages of any sort that are related to it, read exactly what the messages are saying, and only what the messages are saying, and start from there.  Don’t immediately try to infer what they mean.  Pay attention.

I mean, it is *possible* that you could have a generally sophisticated validation system that couldn’t tell that 127546.00 and 127546 were identical values, but I think you’d have to try *really* hard to accomplish it.

Once I stopped and reflected on this, I re-read the available messages.  I then started with the basics: where exactly in the system were the individual log entries created, and what exactly were the situations that would cause them?  Dive deeper, rinse and repeat.

The log messages were saying that I had an asset Blah in System A with quantity 127546.00 that didn’t match an asset Blah in System B with quantity 127546.  They were not saying they didn’t match because of the quantities.  But since that was what caught my eye, I wasted time on checking a part of the code that I just couldn’t imagine was failing (of course, if better analysis had led me to that part of the code, it wouldn’t have been a waste of time).

Once I actually focused on reading exactly what the messages said, it was pretty easy to determine that the messages weren’t logging the vital information of what wasn’t matching (for at least a vaguely defensible reason, though having incomplete log entries is really annoying).  And that led to the fix.

And I really do know better.  Back when I was on the hook everyday for fixing live production bugs, following this fundamental rule was second nature.  I’m obviously out of practice.

Just because you think the log messages are saying something doesn’t mean they actually are. 

posted on Wednesday, July 28, 2010 9:02 PM Print
No comments posted yet.

Post Comment

Title *
Name *
Comment *  
Please add 1 and 1 and type the answer here: