Adding Tests to a Legacy System through New Requirements

I’m going to try to describe the way that I would approach adding tests to a system that doesn’t have them, and wasn’t built to support them.  It will be a simple example, and isn’t intended to be anything other than a way in which I might approach this situation.

When I say that it wasn’t meant to support them, I mean it has certain characteristics:

  • Uber methods that access the file system, do some processing, make a call to a web service, send a new file through FTP, etc.
  • No interfaces
  • No ‘full’ dev environment (more on this below)
  • No CI processes

Obviously, in a situation like this, there are some challenges.  It is tempting to begin to think of ways to re-architect the entire system, but like many temptations, it is one that should be avoided, unless and until there is a directive that will support it.  Instead, I think it is better to begin to put hooks into the system in a simpler fashion: add tests as new requirements come in.  This has a number of advantages: it limits the amount of work that needs to be done, it makes it easier to define and develop the tests themselves, and it can function as a pilot program for introducing testing into an environment and/or for a team that isn’t familiar with it.

A couple of obvious caveats are in order.  You have to have the ability to introduce tests in the first place.  If you are lucky, the other members of the team will be amenable to it.  Also, as I mentioned at the start, I won’t pretend that the approach I will be describing is the best way or the only way or that I will do it perfectly.  In other words, blah blah blah.

The Requirement

Big boss man: “We need to audit the accounts we send externally against the audit file they send back.  No one will read the report, it’s just to get compliance off my back.”

Obviously, this is a somewhat vague requirement, but that is normal in most cases.  Depending on whom is generating the requirements, you will get more or less technical detail, more or less justification for it, and more or less indication of how important or urgent the requirement is.  You will obviously then need to do more or less work to get the technical details you have to have in order to write code. 

In this case, the technical details are something like this:

  • The accounts are taken from a database and sent via xml to an FTP site, which will then process it through a web service.
  • The audit file sent back is a compressed .gz text file, comma-delimited, with a trailer record that gives the number of accounts in the file, and a timestamp.
  • There are clear rules of what counts as a discrepancy.
  • A file must be created as part of the auditing process that lists all accounts that have discrepancies and stored in a more or less common place.
  • There is a more or less standard email format for other reports.

How to start

Once you have enough detail, you then need to determine how to start.  As we’ve discussed, we will want to create specifications of the things that need to be tested, before we write the code that implements them.

If you noticed, I used the word ‘specifications’, not ‘tests’.  Unit testing is often described in terms of TDD (Test-Driven Development), but it seems to me that there are problems with this.  TDD focuses the developer on individual classes, but that might not be the right place to start.  Furthermore, it doesn’t really give any clear guidance of what to start with.  Since there are databases and files and FTP and email involved, code will obviously need to be written to manage these things, but it is a mistake to start here.

BDD (Behavior Driven Development) tells you to start with the behavior of the system, but that is also (potentially) vague.  What is least vague (since there is almost always going to be vagueness along the way) are the specifications.  What are these?

Start with the ‘domain’

I’m using scarequotes here because “Domain” is an overloaded term.  It is clear that in a situation like this, you are unlikely to have a fully developed domain model, and are likely not to even have a partially developed one.

But in this case, there are two items that are key: the idea of an account and the idea of a discrepancy.  This is the heart of the requirement.  Given what we might call a source collection of Accounts and a target collection of Accounts, each Account will either produce a match or a discrepancy.  If an account is in the source, but not in the target, that’s a discrepancy.  If an account is in the target, but not in the source, that is a discrepancy.  If an account exists in both the source and the target, but doesn’t meet the rules, that is a discrepancy.  Otherwise, there is a match (one could argue that the idea of a match also exists in the ‘domain’ and that it should also be made explicit…whether to do so or not is often a matter of discretion).

Given this, there are a couple of obvious specifications that can be written out:

  • when_an_account_exists_in_the_source_but_not_in_the_target_there_is_a_discrepancy
  • when_an_account_exists_in_the_target_but_not_in_the_source_there_is_a_discrepancy
  • when_an_account_exists_in_both_the_source_and_target_but_fails_to_meet_the_matching_rules_there_is_a_discrepancy
  • when_an_account_exists_in_both_the_source_and_target_and_meets_the_matching_rules_there_is_no_discrepancy

A couple of things are worth noting here.  The underscore syntax porn is something that has always bothered me, but I’ve learned to live with it for two reasons.  First, when using something like MSpec, the underscores are replaced with spaces in the reports that are generated.  Second, at least for me, I’ve come to accept that it is easier to read specifications when using them, as opposed to using camel casing.  I still notice the spaces, obviously, but I’m used to them enough by now that it is almost like reading a sentence with spaces.  Almost.

Another thing to notice is that when writing out specifications, it is usually evident that there are many of them.  TDD encourages writing one and only one test, then writing the code, then writing the next one, etc.  This has always bothered me, as I personally find that I naturally think of multiple specifications at once (and if you are lucky enough to be able to work through the specifications with a business user and/or an end user (sometimes these are the same user, sometimes not), you will naturally work out many of these at a time), and I don’t see any reason why you shouldn’t write out multiple failing specifications at a time, and then implement the code one by one.  If the specifications cross many ‘domain’ items, it would probably be best to group them, and then work with a group at a time.

Most importantly, especially since there is no full dev environment, by focusing on the ‘domain’ you can avoid dealing with the ‘tough’ things upfront, like file access, FTP work, email notifications, etc.  Even if you could run an end-to-end test in a dev environment, you don’t want to be tied down here.  You assume that, at some point, you will be able to get the source data from a database, and that you will be able to access a file, and that you will be able to send an email, etc.  But for now, to start, you only should care that given that you can get source and target data, you can identify all of the possible discrepancies and properly find matches.

Needless to say, you can write this code easily and locally.  Given a source collection and given a target collection, produce the results you want.  Red-green-refactor or what have you, easy money.

A crack at a ‘tough’ item, file access

I’m not going to go through all the different ways one might handle database access, email notification, etc.  There are many different ways to approach these.  But I will try to give a general sense of how I might approach all of them by talking about how you might handle file access.

Since it is always good to use services, the immediate idea is to create an IFileService.  Interfaces are good.  Since one needs to get a set of accounts from an audit file, the immediate thought is to create an GetAccountsFromAuditFile method on this interface.

But is this right?  Once you create such a specific method on an interface, then every implementation of that interface has to, well, implement it.  And that doesn’t seem so great.

This is where discretion comes in.  Given the specific example that I’ve given, I think it is okay to do this.  I don’t know, as I’m adding the specifications/tests for the new requirements, exactly what will come from it.  I could end up with a half a dozen methods on this interface.  I could end up with many dozens.  Since I don’t know for sure, start with what is simplest.

More importantly, I think it is best to avoid creating generic GetFile methods on this interface, though it is tempting to do so.  When faced with a non-specific method like GetFile, it is easy to get lost in trying to think of all of the various possible ways that you might use it in the future, ending up with multiple parameters to handle all of the possible permutations.  Instead, stick with the actual requirements.  You need to retrieve an audit file for a specific purpose.  Make the code explicit with an GetAccoutsFromAuditFile method.  The bad thing about this is that you won’t be able to reuse this method for future requirements.  The good thing is that you won’t be able to reuse this method.  Reuse is good when it is well thought out reuse.  Reuse is bad when it is random and non-specific.


When you can, start with the ‘domain’ logic.  Test to the requirements.  When you need to start writing code to implementation details, try to find a way to limit the specificity of those details.

posted on Sunday, September 27, 2009 10:54 PM Print
No comments posted yet.

Post Comment

Title *
Name *
Comment *  
Please add 8 and 2 and type the answer here: