I have always been a booster for testing. I recently posted about resisting the temptation to fake your tests in order to put your system into the green. This is a kind of fetishism, of course – an attempt to change something fundamental in the world by altering a trivial signifier. But what if your tests are themselves already a kind of fakery? How many of the tests we write are actually useful, and how many of them are more a comforting exercise in box tickery?
Let’s define some terms, because not all tests are created equal. As you probably know, a unit test is designed to operate at the level of a component – usually a class. The idea is to test the component in isolation from the rest of a system. That way, you focus in on the functionality at hand rather than implicitly testing all the logic the test incidentally relies on.
That makes a lot of sense. In writing a test for a class, it’s the class I want to put through its paces and not the vast tangle of the wider system. It’s quite hard to isolate a method from its context but, thanks to test doubles (mocks and stubs), it can be quite easily achieved. A mock or a stub is a sneaky component that stands in for a real counterpart. They are similar, except that a mock is a little more Machiavellian and also spies on the behaviour of the component that calls it. Either way, you can create a set of imposter objects primed to accept method calls and to return any necessary information, freeing you up to concentrate your testing on a single class or method.
Our unit tests were bullshit…
So we’re all good then? The trouble is that modern OO design tends to result in nicely focused components. Many class methods don’t actually do that much. They take in data at one end, interact with the wider system via helpers, and generate a result at the other. The actual work they perform, once you have fully isolated them, can be pretty minimal.
Consider a controller method in a Web application. Of course, the details vary from framework to framework, but the problem space always suggests a similar structure: an incoming request, calls into the wider system to get work done, an outgoing response.
If we’re mocking everything apart the controller process then, well, there’s not much to it. It’s really a series of conditionals. So first, is the incoming data sane? If not, the method might invoke an error flow. Otherwise, it can pass the data on to some system components to get work done, then populate a Response object with the returned data. This will find its way back to the user, probably via a view. The view itself will likely be tested elsewhere if at all. It is not, after all, part of the controller.
So that’s the first problem: Once you have discounted the work that surrounds a target method, there’s often not that much to test.
A second issue lies with test doubles. Not their implementation, which is ingenious and satisfying. The problem is that the data they must generate is often quite complex and the processes they mock are often involved. Consider, for example the wall of JSON data generated by a Stripe webhook call, or all the work that goes into making a series of API calls and generating collections of objects. The question is: does the data being presented to the method under test actually model sane scenarios? The real world is complex, which is why it ends up being represented by complicated JSON files. Systems, too, are complex – if not in their individual parts then in the way those parts interact – and they can generate deep data structures. By contrast, test doubles often only manage limited or unrealistic contextual data.
So, for example, when representing a Stripe webhook payload we might copy a JSON document genuinely generated by Stripe (all several hundred lines of it) and adapt it to reflect a test scenario. We drop this document into a fixture directory and serve it via a test double. We take the win when the test passes. But, of course, there are all sorts of variations that Stripe might provide according to circumstances – error conditions, or different shades of success, tweaks according to the location or status of the customer and so on
A comeback here might be that a good test would run against a truly representative range of datasets. This is fair. And it almost never happens – because, in generating realistic representations of the surrounding system and world, the tester’s focus must leave the unit to encompass involved and complex external domains. In other words, in writing a unit test, you end up having to reproduce a working version of the universe around it anyway. And not everyone has time for that.
So the temptation is to freeze an idealised version of the component’s world and write a test around that. The reason for this is that, often, unit tests are only written to provide a satisfying field of green. A nice full coverage report with passing test cases. That is satisfying even if many of those cases, on deeper inspection, reveal very little of value beyond telling us that, if the universe conforms to an arbitrary invented state, values go in at one end and values are spat out at the other.
This is a anti-pattern in any target-based organisation. The act of ticking a box is often confused with the value the box was originally designed to confirm. I recommend David Graeber’s Bullshit Jobs as a great primer for instances of this tendency in the wider worlds of academia and corporate life.
Of course, in some circumstances, a component flips this model. It takes relatively simple input and does complex work with it. In that case, a unit test can be invaluable, protecting you against the bugs that can occur when changes are made. In our recent work, however, chasing an elusive field of green across many controllers, we discovered that most of the components we tested fell more into the input/output mode than that more interesting algorithmic scenario.
Even so, though, the quality of our code was greatly improved at the end of the process. Our system was more stable and less prone to bugs. Why?
… But they improved our system
Even if unit tests do not always provide a huge benefit when run, we were surprised to find that adding them nonetheless had a very positive impact on our system. That was because, in preparing our code for testing, we were forced to review and refactor it.
First of all, the controllers were carefully examined as part of this process and multiple bugs were found and fixed.
Then, we took steps to make the controllers more independent of the surrounding system. In concrete terms, that meant that we rooted out a few global variables, a few more direct object instantiations and still more static calls. This made it easier for us to use test doubles in place of the system’s concrete classes and thereby isolate the controllers for testing.
Without getting too deep into the weeds, we achieved this by implementing a dependency injection container and by leveraging our existing service locator.
In achieving the primary objective, that we were able to create truly focused unit tests, we reaped a secondary benefit – that our system ended up safer and more transparently designed. The fact that this was, in fact, a by-product of preparing our code for testing rather than of the testing itself is ironic.
It’s also a little problematic that we got our process backwards. Tests should come before refactoring. We were only confident in managing the risk of carrying out the preparatory work because we already had a functional test suite in place. But that’s another story.
I often rail against fetishism when I write about code and management. But this may have been one case in which the bullshit enriched our world. Cautiously, I’ll take the win.
Photo by Patrick Tomasso on Unsplash