Brief Review of Hibernating Rhinos Episode #10: Production Quality Software

A brief description of the episode and a link to the download can be found here.  It was produced by Ayende, a well-known and brilliant developer (but more than just a developer) within the .NET space, especially within the Alt.NET space.  I can’t really do justice in a short description to who he is or what he has done (surf his blog, and in particular, find his post on leaving active duty from the Israeli Armed Forces (I hope I described that properly), it is a *fascinating* read).  I’ve met him once, at the Alt.NET Open Spaces event in Seattle in April of 2008, and probably the best compliment that I can give him is that he is respectful.  Even when he thinks you are fundamentally wrong about something, he expresses his opinion in a respectful manner that is natural and unforced.  A skill I wish I had.

But enough about that.  Again, read his blog.

A notion that has been discussed in many places and on many levels is one of ‘maintainability.’  I’ve discussed it in places like here, here and here, and so don’t want to rehash all of that again, but a brief note.

My professional career in IT began from a SQL architect/developer standpoint but mainly from an operations/production support standpoint.  Because of this, I have always viewed maintainability as being much, much more than ‘just’ making code easier for the next developer to maintain.  It has also been from the point of view of maintaining code through deployment and production support.  It is no small feat to produce code that is more easily maintainable by the next developer who comes along, I don’t want to minimize that.  But, truly maintainable code, in my view, involves making deployments maintainable (see some of the above links), and also makes production support of that code maintainable.  If I had to coin a term that encompassed what I meant, it might be ‘sustainable code.’

With that in mind, Episode #10 is a very entertaining and enlightening overview of the sorts of things that one can encounter when dealing with code after it has been promoted to production, and the ways in which the problems you face there are different from those you face when developing code before it is promoted.

Roughly speaking, the episode is split into two parts: a list of the types of difficulties he has encountered in dealing with production support issues, and some ways that you can help to mitigate these issues.

The first part involves some of the common problems that one can run into (I will include some of my own experiences as well as what Ayende mentions).  All software fails at some point, but the points of failure are often different from the reproducible bugs that occur during normal software development.  There is the obvious ‘the software is slow’ issue that can come from a mis-configured router, for instance, but also often occurs because the nature of a production environment is often so incredibly different when it comes to volume (in terms of data size, client access, etc.) than what you will typically encounter before you hit production (I’ve worked with a client that had tens of TBs worth of data in production, and for many many reasons, none of the development or QA environments could be set up to have the same amount of data).

When I was working in the days, I encountered a number of problems here.  During the early days of, one of our figurehead owners was John Elway.  During the Super Bowl contest between the Rams and Titans, he was interviewed by whichever network was broadcasting the game in the 3rd quarter and the interviewer tossed off a bunch of softball questions about what he was doing with his life, since he had retired.  After mentioning the whole ‘spending quality time with the wife and kids’ stuff, he mentioned that he had started a company called  We never could get exact metrics, but in the seconds after he mentioned our company, our traffic went from something like 30 concurrent users to tens of thousands.  Needless to say, our infrastructure took a moment to reflect on that increase in traffic, and promptly gave up the ghost.

Similarly, when was announced at some gala press event, our CEO of the time made the one good decision of his tenure that surfing the site live during the announcement would be ‘boring.’  This was fortunate.  Our networking setup was designed for high availability and redundancy and so we had entirely mirrored the networking stack between the active side and the passive side.  In case the active side failed, the passive side would take over.  At it was ‘cutting edge’ stuff.  Which meant it had bugs.  In particular, the crush of traffic that resulted from the press announcement meant that the passive side lost sight of the expected heartbeat from the active side, and took over.  Which meant that the now passive side lost sight of the now active side, and took control right back.  Back and forth.  With the site becoming unavailable each time the networking gear switched sides, for hours on end, until we killed the redundancy piece (we may have even shut down one entire side of the stack, I forget at this point).

We had, of course, done a number of simulation tests prior to the press event, but we simply couldn’t anticipate the actual course of events that occurred.

Many other production issues are not tied into items as obvious as this.  If a system is programmed to handle batches of events every 5 seconds, all is well and good.  Unless it turns out that in production, a single batch takes 10 seconds to process, leading to a cascade of failure.

Other production problems occur when you have a schedule for jobs that need to run over an interval of time, say, once every two or three weeks.  But the job requires permissions on a file system or other process, and those permissions somehow change between week one and week three.

These are issues that always lend themselves to ‘after the fact’ remediation (now that we know this can fail, how do we test conditions so it doesn’t fail the next time), but are very difficult to test ahead of time in non-production environments, because they just don’t tend to occur in non-production environments.

The second, more prescriptive, part of Episode #10 discusses ways of working through and around these things.

A very important point that Ayende makes is that the sort of logging that one is used to using for development purposes to combat development bugs doesn’t really work in fighting production bugs.  Logging is quite often way too verbose and way too low-level to be of use there.

I can vouch for this from my own experience.  A typical thing that one does when setting up a production environment is to use various tools that can produce alerts when there is a problem, alerts that can be routed to, for instance, a production pager.  A difficulty that arises is that these alerts often cascade.  When there is a problem in the production environment, it often times ends up producing such a volume of alerts that the pager is overwhelmed.   Log files become so huge that it becomes very difficult to be able to scan through them to find the root cause of an issue.

A key concept that Ayende discusses is the use of a message bus, where events in the production environment produce messages that can be consumed by the production database, but also by an operations database.  Take something as basic and fundamental as the processing of an order.  If you can ‘log’ to an operations database something as simple as the start and end times it took for an order to be processed, you can begin to develop rudimentary but vital reports that show when the processing of an order starts to take longer and longer (but without blocking), instead of waiting for the point at which order processing starts to hard block.

The other key concept that Ayende discusses along these lines that I fully endorse is the idea that you develop and design your code with production support in mind.  During the normal course of operations, what will they be looking at?  What are the events that will trigger production support elevation, and how can you give proactive indications of pending problems before they trigger elevation?

Overall, I think this is a great episode to watch, and one that all developers who believe in the notion of maintainability should take note of.

posted on Wednesday, February 04, 2009 11:37 PM Print
No comments posted yet.

Post Comment

Title *
Name *
Comment *  
Please add 6 and 8 and type the answer here: