BlogCoward - February 2009 Blog Posts

Posts: 832
Comments: 691
Trackbacks: 1

February 2009 Blog Posts

DDD-Lite: Mama don’t let your objects grow up to be invalid (take two)

So, in a previous post, I was making a point about preventing classes from being invalid by getting rid of setters (public setters at least).

But, I did it in a way that was misleading, or at least could be. For one thing, I used the term ‘entities’ in the title of the post. Why was this bad? Well, within DDD, there is a distinction that is often made between objects that can be called “entities” and those that can be called '”value objects.” And the purposefully simplistic example that I used (that of an Address class) is normally considered to be a value object and so the title of the post seemed to be inaccurate. Moreover, it made it seem as if the general point really only applied to value objects, when in fact, it has as much, if not more, to do with entities.

What is the distinction between an entity and a value object? Read the book. Thanks for coming, good night! No, no, that won’t do. Though there are different ways of laying out the distinction, one way of making it is to say that a value object can be defined in terms of the collection of its attributes, while an entity is something that has an identity over and above its attributes. The basic idea seems simple enough: if two Addresses have the same street address, city, state and zip code, they are in fact the same Address, while two Customers with the full name of John Smith can still be two distinct Customers.

digression: but it isn’t really that simple. The concept of identity is one that is very difficult to examine adequately. It is a hard subject, very hard, painfully viciously hard. Okay, maybe it isn’t that hard, but it ain’t easy. Consider the concept of personal identity. What makes you you? In common parlance, we want to say that someone, in some meaningful sense, maintains their identity through time, but how do we, well, identify that which defines one’s identity? One’s body changes through time, so that can’t be it. Maybe it is the chain of memories that we have, but one’s memories change through time. And what if the chain is broken? We might want to say of an amnesiac that they are the same person, even though their memories are damaged. Maybe it has something to do with self-awareness, the phenomenological experience that you have of being a person who persists through time. Thus, a possibly standard science fiction trope of one’s self-awareness suddenly being moved to another physical body, and the wacky hijinx that ensue. But one’s self-awareness has broken links. I go to sleep at night. Sometimes I dream, but not always.

It gets (arguably) worse. Keying on commonly accepted scientific theories and evidence, some philosophers argue that there are, for instance, no such things as beliefs, because you cannot in fact give a solid identification for what makes something a belief as an actual mental ‘thing’. Instead, there is all this neural activity going on in the brain and we act in certain ways, but there are actually no such things, and the same could be said about memories, so we couldn’t identify personal identity in terms of mental properties, because they aren’t real (the cheap philosophical retort is to ask if these philosophers believe that there are no beliefs).

And it gets (arguably) worse still. In common parlance, we would say that I am sitting on a chair as I write this drivel. But what makes that chair the same chair? If a screw falls out of it, it isn’t normal to say that it is no longer the same chair. What if I take a chain saw to it? At what point is it no longer a chair? Keeping power tools out of the discussion, if we dive down to the particle level, we can’t even identify which particles make up this ‘thing’ that we call a chair, as they are constantly entering and leaving the same physical space (though what makes a ‘physical space’ ‘the same’ as another?). And we can’t even identify what makes a particle the same particle.

Driven to extremes, the notion of ‘Identity’ seems to break down until, perhaps, it is only things like numbers, arguably the 'quinessential’ value objects, that have a strict identity, which can be defined exactly in terms of its attributes.

Bringing this back somewhat to software development, consider a Customer within a CRM application. If there are two Customer records that share all of the same relevant (completely ignoring how to define ‘relevance’) attributes save a middle initial (same first and last name, same address, same phone number, same credit card, etc.), we would normally say that this is probably an error condition, that they are the same Customer. And we determine this by examining the attributes. Conversely, in a Property Management system, an Address doesn’t seem to be a value object (though it would probably be called a Property or Location, to be sure).

But I digress.

Moreover, in said previous post, I made the point within it based on two mutually dependent properties, but that is only a certain type of validation concern.

So, let’s try this again. Briefly.

Why should we care about allowing objects to be invalid?

The first point is very simple. Once you allow an object to be invalid, then at certain points in a system, you will have to ensure that an object is valid to prevent BAD THINGS from happening. Could be because of interactions with other objects, needing to persist those objects, or a host of other reasons. At each of these certain points, you will have to remember to do a validation check. And, I think we can all agree, once there is something you have to remember when developing software, there will be a time and/or a place where/when you will forget.

Moreover, validation very often involves much more than just ‘simple’ (in terms of lines of code) comparison of one property to another. It is usually much more along the lines of “Given conditions A, B and C, allow D, E, or F, but not G or H. But, given conditions A, B, and I, allow D and G, but not H or Q”. And so on. This is where allowing setters on properties can lead to some difficulty. With setters, I can change my conditions piecemeal, throughout different parts of the system, and then hope that validation is successful even later on. But what if it isn’t? What if validation fails? Which of those piecemeal conditional changes are the right ones, and which are the wrong ones? How do you rollback to a valid state?

If you eliminate the possibility of an object being invalid in the first place, you help to manage (note, I didn’t say ‘eliminate’) these issues. If, at any point of change, you reject a change that will produce an invalid result, you can (potentially) ease the cost of handling the misguided change. For instance, if instead of changing certain properties of a Customer one by one and then later validating, you pass to your Customer class an ICustomerProfileChange message to it, you have one point of contact, so to speak, one place to handle the validation issues.

Then there is the cost of allowing invalid objects into your system. Once invalid data is there, how widespread can it get? A very common task within many environments is data cleanup, fixing all of those errors that have crept in due to various reasons. There can be a significant maintenance cost involved here. This is something I deal with at client sites all the time, fixing all of the invalid data spread throughout the system. If you prevent the source of the data from being invalid, then you can more easily manage (note, I didn’t say ‘eliminate’) this issue.

If you have invalid data being persisted, then you didn’t validate it before you persisted it.

Yes. But, that is sort of the point. Someone forgot to check whether the class being persisted was valid.

Elminating setters makes it harder to get things done.

Yes. But, that is also sort of the point. By forcing changes to occur in batches (can you tell I come from a SQL/ETL background?), you force changes to be done in consistent chunks.

So, that’s it? Eliminate setters everywhere?

Nope. Blindly following this idea could be bad, for a number of reasons.

By never allowing invalid objects, can’t this increase the cost of maintenance?

Yes. It can. ETL projects are an obvious case, but in general, you have to judge the cost of allowing invalid data from entering your system against the cost of blocking the import of data because it has some invalid data. It could be that the cost of cleaning up, say, invalid Customer data in a CRM system is less than the cost of ensuring the source of the data is valid in the first place. Allowing an invalid Trade is a different consideration. Then again, it all depends. As with anything else, you always have to make judgments about what is better or worse based on the specifics of the system. Sometimes, you just want to be able to change the First Name of a Customer, and forcing a ICustomerFirstNameChange message is just silly. Blind adherence to any pattern is problematic.

Summation

By forcing changes to objects to adhere to specific ‘batches’, you increase the potential of preventing invalid data from entering your system. As always, you have to consider the costs of doing this before deciding if the cost is worth it, due to cost-savings you will gain elsewhere.

posted @ Friday, February 27, 2009 8:32 PM | Feedback (2)

DDD-Lite: Mama don’t let your entities grow up to be invalid

Update: Troy pointed out that using Address as an example wasn't good, because in DDD, Address is considered what is called a 'value object' which has different features than what are called 'entities.' Since the point I was trying to make was about validation and why eliminating setters is an important concept, that wasn't totally relevant, but it's a valid objection (pun intended) as it could be misleading. So, I'm going to update the example. In a minute or two. Just keep that in mind till then.

One of the hardest things about ‘continuous improvement’ (which I think really amounts to ‘not sucking more at the end of the day’, but I digress….seriously, some steps one might take to become a better developer are steps backwards, which I think the ‘craftsmen’-type people don’t discuss enough…but I digress) is trying to keep up with all of the possible information out there, from blogs, whitepapers, and other things. I estimate the number of posts I get in my various readers are dozens a day. Even filtering out the fluff, there is a lot of great stuff out there. On top of that, I subscribe to a number of mailing lists, which just increases the number of things to read on a daily basis if I want to keep up.

Blah, blah, blah. Anyway, two notions/concepts/thoughts pop up on the Yahoo DDD mailing list that are related, but not necessarily clear on first glance. One is the idea that you should never let your entities enter an invalid state, the other is that you should get rid of setters on your entities altogether.

I won’t claim to understand all of the implications this has if you are doing full-blown DDD, but within DDD-Lite, I think I can give an example of what is involved. YMMV.

Suppose you have a (very simplistic) Address class that has a City property and a Zip Code property. You will want to be able to validate that these two properties match in your application. But there is an immediate circular dependency problem here.

Suppose you create an instance of your Address class and set those properties. So (very naive implementation):

Address a = new Address();

a.City = “Valparaiso”;

a.Zip = ‘46383’;

Ignoring the obvious difficulties, let’s assume there isn’t a validation problem here (though there is, as we’ll see). Suppose that you want to update the Address so that the City is ‘Chicago’ and the Zip is ‘60618’. One might think you can do it this way:

a.City = “Chicago”;

a.Zip = “60618”;

But, if you change the City property first, unless you do validation, your entity is invalid, as the City = “Chicago” does not validate against the Zip of “46383”, which is what the Zip is at the time of the change. It needs the Zip to be changed so that Zip = “60618” to be valid. But, if you try to set the Zip = “60618” first, your entity is invalid since that does not validate against the City of “Valparaiso”. So, you can’t change one without the other at the same time before you do validation.

The first time I read something about this (long before lurking on the DDD list), I had the immediate, gut-check reaction: this is just stupid. You have to be able to make multiple changes. You will generally do this from some UI screen anyway, so you will pass in all of those changes in as a bunch.

But this side-steps the issue. In your batch, as you change the properties one by one, your entity can’t do validation on each property change, instead it has to do validation on the entire batch. Immediately, you have the difficulty of managing this. If your entity has property validation, you have to make sure it doesn’t do this outside of the entire batch, so you have to write code to manage this.

More importantly, if your entity has setters for City and Zip, and these perform validation, there is nothing to prevent client code that has access to your entity from trying to set these properties individually.

Okay, so how the hell else are you going to do it? This is where eliminating setters comes in. You prevent client code from being able to set individual properties at all. Instead, to change an Address, you have to pass in the ‘full batch’. The mechanisms by which you could do this are many, so let’s examine a simple one. First, to prevent the initial problem with validation, you force the constructor for an Address to take in a batch:

Address a = new Address(“Valparaiso”, “46383”);

Your constructor enforces validation on creation. Then, to change the Address, you pass in the batch this way (listing the constructor again):

Address a = new Address(“Valparaiso”, “46383”);

Address new = new Address(“Chicago”, “60618”);

a.ChangeAddress(new);

The ‘ChangeAddress’ method takes in the entire batch of a different address, and so can perform validation on the batch.

Why? Really, why do this?

If you eliminate setters and allow only methods that attempt a batch update, then you no longer need to check the validity of an entity. So what?

Well, think about what you might typically need to do if you are passing an entity between tiers. At each tier, you might need to do an if (entity.IsValid()) check before proceeding. What’s the likelihood you will forget to do this check?

If you prevent an entity from ever becoming invalid, you relieve yourself (and your codebase) from having to worry about this.

Summation

This example is incredibly simplistic. But, imagine that the class isn’t of type Address, but of type Order, or of type TradeRequest, etc.

Always force your entities to do validation on creation, so that you cannot create an invalid entity. Eliminate setters, and replace them with method calls that will perform validation on a batch update. This will also prevent you from changing a valid entity to an invalid one. By doing so, you elminate the need to perform validation checks across your application. In this simplistic example, you pass in entire entities to your change requests, but you will probably do this by passing in messages or commands.

Regardless, the point is the same. By eliminating property setters, you eliminate the risk of having invalid entities at any point in your application.

posted @ Friday, February 20, 2009 9:27 PM | Feedback (4)

DDD: Step by Step Guide by Casey Charlton

Casey Charlton is putting together a series of posts about DDD that are downloadable in a single PDF file here.

As he updates his series, he updates the PDF. I think this is a great service. It introduces a number of the most important concepts of DDD in a clear and concise way which helps to dispel a lot of the ‘DDD is mystical’ stuff. Even if the actual experience of practicing DDD has ‘mystical’ aspects (Casey knows I’m a bit skeptical about this, though I think I do get the point…to a point), many of the main concepts of DDD can be explained succinctly and his blog series proves it.

I actually want to be able to apply some of what he explains to my ‘DDD-Lite’ ‘series’ of posts, which has been a bit meager recently (though that is because I’ve actually been applying some of what I had already mentioned). One point (about eliminating setters) I hope to post about soon.

Regardless, check out his series. Great stuff.

posted @ Friday, February 20, 2009 8:06 PM | Feedback (1)

And There You Go

Though it will make no difference, and is a move of desperation, the Pens’ pathetic effort against the Maple Leafs was the sort of game that gets a coach fired, and he was.

So Therrien is gone less than a year after coaching the team to the Cup Finals. Obviously, the team’s poor play was all his fault.

There had been a lot of rumors about whining that the players didn’t like him. Well, you stunk it up enough to get rid of him, so shut the hell up and play now.

posted @ Sunday, February 15, 2009 8:31 PM | Feedback (2)

Stick a Fork in Them

Not only will the Penguins not win the Stanley Cup, they won’t even make the playoffs.

The last time I was going to write this (a couple of weeks ago), the Pens were down 3-0 at home to the Tampa Frickin’ Bay Lightning, and Malkin decided to play Superman, and they won 4-3 in OT. Okay, fine. The Pens are back, blah blah blah.

Tonight against the not very good Maple Leafs, they went up 2-0 in the 1st. And promptly decided to stop playing for the rest of the game, to give up six unanswered and lose 6-2.

This is a gutless, weak, heartless team. And they aren’t very good either. They have a good shot at going 1-2 in the scoring race with Malkin and Crosby. And then watching in April.

Good thing I have the Canucks to fall back….oh, right. That Sundin signing has gone well.

Go Sharks.

posted @ Saturday, February 14, 2009 8:38 PM | Feedback (1)

SRP, A Problem

SRP, the , has its canonical statement as the following:

“THERE SHOULD NEVER BE MORE THAN ONE REASON FOR A CLASS TO CHANGE.”

This is all well and good. But consider the following:

1) A class that has more than one reason to change violates SRP.

2) A class that has more than one function has more than one reason to change.

3) A class that has more than one function violates SRP.

Or:

1a) A class that has more than one reason to change violates SRP.

2a) A class that has more than one property has more than one reason to change.

3a) A class that has more than one property violates SRP.

Now, almost anyone who accepts SRP would reject 3 or 3a (there’s always some nutjob out there, so ‘almost anyone’.).

But, 1-3 and 1a-3a are valid arguments. Therefore, 2 and 2a must be rejected as false.

The problem is explaining exactly why they are false.

Why are they false?

posted @ Friday, February 13, 2009 7:55 PM | Feedback (10)

The Gospel According to Uncle Bob

Update: video of the presentation is here, video of the Q&A is here.

We had Robert C. Martin, aka 'Uncle Bob' come to the Feb. Chicago Alt.Net Group meeting, where he gave a talk and then answered questions from the group. We had a good crowd, about double our normal attendance, as a number of non-.NET people were there, but we got a few new faces who look to become regular members.

He was 'dry-running' his upcoming SD West talk (which he was reading off of....you guessed it, index cards), and it was very entertaining, somewhat informative, and somewhat less convincing (to me). Sergio should have the video up...eventually. Here's my brief rundown (a very rough paraphrase of things I remember/found interesting) and some random comments.

Recap

After 10 years, why are we still talking about Extreme Programming (XP)? Well, we aren’t. What are we talking about? Scrum. How did this happen?

10 years ago, Beck’s book came out. It had two fundamental flaws in it. It had ‘Programming’ in the title, and also ‘Extreme’ in the title. All of the geeks were thrilled, but the business people were horrified.

Shortly thereafter, there was the great meeting of the minds that produced the agile manifesto. Also, something happened that was a good thing (but which had some bad consequences down the road). Someone stripped the ‘geek’ out of agile and called it Scrum. Scrum is a pure subset of XP and focuses on the ‘business’ concepts. If you are doing XP, you are doing all of Scrum, but if you are doing Scrum, you aren’t necessarily doing XP.

And this was good. The business people suddenly liked the sound of this ‘agile’ thing, and it took off. Especially because of the concept of Scrum Certification. Agile adoption within the corporate world spread like wildfire. Agile is everywhere.

But along the way, something happened. Teams that took on Scrum discovered that their velocity took off. For about a year. And then they noticed that it started to level off, and steadily decline. Why was this?

Scrum’s focus is on delivering features, features, features. But, Scrum left out the technical practices of XP that allowed it to be maintainable for the long run, things like TDD and pair programming and refactoring. When all of your work is focused on delivering the next feature (or set of them), you end up with a mess that makes it harder to introduce change down the line.

Why is TDD so important? Consider common complaints about TDD: “It takes too long…we have deadlines…we don’t need to write tests for every single piece of functionality”.

Well, imagine that you are a patient undergoing open heart surgery. In this situation, the surgeon has a definite deadline. If you stay on the external support systems too long, you will die. How do you want the surgeon to behave during this time? The last thing you want is for him to cut corners, to introduce hacks (I had to introduce that bad joke-jdn). You want a professional, a craftsman.

Software developers need to learn to be craftsmen, to stop cutting corners, to stop introducing hacks, all under the excuse of meeting deadlines.

TDD allows you to have a suite of tests that, except for the ‘red’ part of ‘red-green-refactor’, will always pass a build. At any given time, you can stop coding, and the build will pass. You can introduce change, and the build will pass (well, the tests might fail, but you know when and where). You might have to debug once in a while, but you will know when and where. Once you internalize the discipline of ‘red-green-refactor’, you will be empowered to make changes because your tests will give you the confidence to do so.

And when do you refactor? All the time. It is not ‘we will refactor during the next iteration’ but a constant process. It isn’t just ‘feature, feature, feature’, but tested and refactored feature after tested and refactored feature.

The people who criticized XP in the beginning (“It will lead to a mess”) were right but for the wrong reason(s). By bringing XP back into Scrum, all will be restored.

Too many people who aren’t craftsmen have been let into software development, but there is a surge of momentum behind the idea of craftsmanship. Lawyers have the Bar, doctors have the AMA, software developers need something similar, but perhaps the better model is karate. One gains a black belt by going through stages of qualification, learning from mentors. Only those actually qualified at whatever level can claim it. Software development needs the same. Perhaps some authority to grant those ‘qualifications’ in some fashion. This problem is being tackled and will be addressed in the next couple of years.

Random Comments

That’s off the top of my head. I think I’ve done a good job of paraphrasing it, but watch the video when it comes out (I’ll update the post when I know).

I’ve never heard Martin speak before, but it was a good presentation. The title of this post is deliberate, of course, and he was definitely ‘preaching to the choir’ on this topic. But I have some questions about it.

I’m pretty sure that as a historical tale, it was a little tall (and to be fair, I’m sure he would accept this…it’s an hour presentation with 30 minutes for questions, with a pedagogical purpose). I have no doubt there would be others, including those who were at some of the various historical events, who might think it wasn’t exactly a documentary. But let’s leave that aside.

His more sweeping pronouncements about the acceptance and proven quality of Agile within the Corporate world and software development are as much wishful thinking as fact. Both of these things are open to question.

I can give one anecdotal example. I was skimming through a lot of background material before the group meeting, and I found prominently noted a case study that involved XP. And it involved a very large, prominent client. A Fortune 10,20,whatever these days sort of client. A client I myself was familiar with. There it was, listed that this client ‘did XP’, and successfully no less. Hmm, strange, my experience with that client was that it was the antithesis of agile, in any form.

This is the sort of confirmation bias that Yegge pegged perfectly. Someone, in a very large corporation, did Agile, somewhere. Therefore, that corporation does Agile and accepts it. Uh, no.

Similarly, as he went through the various pieces of XP that he went through (TDD, Pair Programming, Continuous Integration, etc.), he painted the rosiest possible picture of how each worked to promote quality. Again, I’m not going to criticize this too much…when I was presenting philosophical papers in that previous career, you present a defense of what you are presenting, a strong case for it. Really honest and secure believers will mention the possible criticisms, but it depends on the forum. For people who are interested/knowledgeable in BDD, I would suggest you watch how he presents TDD and find how many times you will find yourself saying, “Well, yes, but…”. I’m not even sure if I like BDD, but that was my reaction (coders creating APIs without business specifications was what I was imagining).

But, again, it was a dry run for a certain sort of presentation for a certain sort of audience. You can’t really say much about that. Although I will add that the XP books were produced by people who came out of the C3 project which was a failure. XP is clearly not proven, and the people who support Agile have a financial interest in it (e.g., Ron Jeffries blue convertible, by the way, James Bach is right). And Martin’s example of something about Pair Programming that was proven by Keith Braithwaite (“without a counter-example” no less) is something I couldn’t find anywhere, the only thing I found quickly was ~~this, a mixed bag.~~ Updated: thanks to Keith for fixing my faulty memory on this one. What Martin was talking about here involved unit testing and cyclomatic complexity, more information can be found in the presentation here, and at his blog here.

The craftsmen thing gives me pause though. A *lot* of pause. The analogies used in the meeting sure sound nice. Karate was Martin’s example, another questioner talked about orchestras. You become 1st chair violin in the NY Met Philharmonic through a long process, etc. etc. etc. How could you disagree with that? Well, let me give you another analogy.

It was obviously not a pure .NET audience, since the term used was not ‘Mort’ but ‘Grunt.’ What do we do with the fact that the ratio of craftsmen to grunt is (to use the suggested ratio from the meeting) 1:10? You get rid of the grunts.

When I think of that, the analogy that I think of is the Khmer Rouge. A small group of self-appointed elites that want to ‘purge’ the unworthy. Now, obviously, I don’t think that the people who are in favor of ‘craftsmanship’ want to kill anyone, not literally. It is an analogy. But I do think they would like to be able to end the careers of the grunts within software development (if you watch the videos, when you hear someone suggesting shooting the grunts, that is me, BTW). And within the alt.NET movement, I can think of various obvious examples of people who would want to be part of the elite so that they could make their decisions on the basis of ideological reasons.

Besides the fact that I think the qualifications and motivations of some people within the ‘craftsmen’ movement are suspect (though I would have no reason to include Martin in this group from anything in the presentation), there’s a sort of hopelessly naive spirit behind it. It’s a lot easier to have limits on orchestras and karate masters because there just aren’t that many of them. But you need a lot of software developers. To use another analogy, there are, unfortunately, a *lot* of lousy teachers out there, but in large part, that is because you have to have a lot of them. There aren’t that many lousy 3rd basemen in MLB, because there aren’t that many you need.

But anyway…

Regardless of all that, it was a great night. The short planning meeting afterwards produced a number of good ideas. I hope that we will be able to do a Code Camp this year, in particular.

posted @ Thursday, February 12, 2009 7:00 PM | Feedback (9)

Brief Review of Hibernating Rhinos Episode #10: Production Quality Software

A brief description of the episode and a link to the download can be found here. It was produced by Ayende, a well-known and brilliant developer (but more than just a developer) within the .NET space, especially within the Alt.NET space. I can’t really do justice in a short description to who he is or what he has done (surf his blog, and in particular, find his post on leaving active duty from the Israeli Armed Forces (I hope I described that properly), it is a *fascinating* read). I’ve met him once, at the Alt.NET Open Spaces event in Seattle in April of 2008, and probably the best compliment that I can give him is that he is respectful. Even when he thinks you are fundamentally wrong about something, he expresses his opinion in a respectful manner that is natural and unforced. A skill I wish I had.

But enough about that. Again, read his blog.

A notion that has been discussed in many places and on many levels is one of ‘maintainability.’ I’ve discussed it in places like here, here and here, and so don’t want to rehash all of that again, but a brief note.

My professional career in IT began from a SQL architect/developer standpoint but mainly from an operations/production support standpoint. Because of this, I have always viewed maintainability as being much, much more than ‘just’ making code easier for the next developer to maintain. It has also been from the point of view of maintaining code through deployment and production support. It is no small feat to produce code that is more easily maintainable by the next developer who comes along, I don’t want to minimize that. But, truly maintainable code, in my view, involves making deployments maintainable (see some of the above links), and also makes production support of that code maintainable. If I had to coin a term that encompassed what I meant, it might be ‘sustainable code.’

With that in mind, Episode #10 is a very entertaining and enlightening overview of the sorts of things that one can encounter when dealing with code after it has been promoted to production, and the ways in which the problems you face there are different from those you face when developing code before it is promoted.

Roughly speaking, the episode is split into two parts: a list of the types of difficulties he has encountered in dealing with production support issues, and some ways that you can help to mitigate these issues.

The first part involves some of the common problems that one can run into (I will include some of my own experiences as well as what Ayende mentions). All software fails at some point, but the points of failure are often different from the reproducible bugs that occur during normal software development. There is the obvious ‘the software is slow’ issue that can come from a mis-configured router, for instance, but also often occurs because the nature of a production environment is often so incredibly different when it comes to volume (in terms of data size, client access, etc.) than what you will typically encounter before you hit production (I’ve worked with a client that had tens of TBs worth of data in production, and for many many reasons, none of the development or QA environments could be set up to have the same amount of data).

When I was working in the dot.com days, I encountered a number of problems here. During the early days of mvp.com, one of our figurehead owners was John Elway. During the Super Bowl contest between the Rams and Titans, he was interviewed by whichever network was broadcasting the game in the 3rd quarter and the interviewer tossed off a bunch of softball questions about what he was doing with his life, since he had retired. After mentioning the whole ‘spending quality time with the wife and kids’ stuff, he mentioned that he had started a company called mvp.com. We never could get exact metrics, but in the seconds after he mentioned our company, our traffic went from something like 30 concurrent users to tens of thousands. Needless to say, our infrastructure took a moment to reflect on that increase in traffic, and promptly gave up the ghost.

Similarly, when mvp.com was announced at some gala press event, our CEO of the time made the one good decision of his tenure that surfing the site live during the announcement would be ‘boring.’ This was fortunate. Our networking setup was designed for high availability and redundancy and so we had entirely mirrored the networking stack between the active side and the passive side. In case the active side failed, the passive side would take over. At it was ‘cutting edge’ stuff. Which meant it had bugs. In particular, the crush of traffic that resulted from the press announcement meant that the passive side lost sight of the expected heartbeat from the active side, and took over. Which meant that the now passive side lost sight of the now active side, and took control right back. Back and forth. With the site becoming unavailable each time the networking gear switched sides, for hours on end, until we killed the redundancy piece (we may have even shut down one entire side of the stack, I forget at this point).

We had, of course, done a number of simulation tests prior to the press event, but we simply couldn’t anticipate the actual course of events that occurred.

Many other production issues are not tied into items as obvious as this. If a system is programmed to handle batches of events every 5 seconds, all is well and good. Unless it turns out that in production, a single batch takes 10 seconds to process, leading to a cascade of failure.

Other production problems occur when you have a schedule for jobs that need to run over an interval of time, say, once every two or three weeks. But the job requires permissions on a file system or other process, and those permissions somehow change between week one and week three.

These are issues that always lend themselves to ‘after the fact’ remediation (now that we know this can fail, how do we test conditions so it doesn’t fail the next time), but are very difficult to test ahead of time in non-production environments, because they just don’t tend to occur in non-production environments.

The second, more prescriptive, part of Episode #10 discusses ways of working through and around these things.

A very important point that Ayende makes is that the sort of logging that one is used to using for development purposes to combat development bugs doesn’t really work in fighting production bugs. Logging is quite often way too verbose and way too low-level to be of use there.

I can vouch for this from my own experience. A typical thing that one does when setting up a production environment is to use various tools that can produce alerts when there is a problem, alerts that can be routed to, for instance, a production pager. A difficulty that arises is that these alerts often cascade. When there is a problem in the production environment, it often times ends up producing such a volume of alerts that the pager is overwhelmed. Log files become so huge that it becomes very difficult to be able to scan through them to find the root cause of an issue.

A key concept that Ayende discusses is the use of a message bus, where events in the production environment produce messages that can be consumed by the production database, but also by an operations database. Take something as basic and fundamental as the processing of an order. If you can ‘log’ to an operations database something as simple as the start and end times it took for an order to be processed, you can begin to develop rudimentary but vital reports that show when the processing of an order starts to take longer and longer (but without blocking), instead of waiting for the point at which order processing starts to hard block.

The other key concept that Ayende discusses along these lines that I fully endorse is the idea that you develop and design your code with production support in mind. During the normal course of operations, what will they be looking at? What are the events that will trigger production support elevation, and how can you give proactive indications of pending problems before they trigger elevation?

Overall, I think this is a great episode to watch, and one that all developers who believe in the notion of maintainability should take note of.

posted @ Wednesday, February 04, 2009 11:37 PM | Feedback (0)

Archives

Post Categories

Stuff

My Company
My Resume

clustrmaps