Arch-Engineer
Why Perfect Testing is Impossible Without Mocks
Unit tests should exercise isolated pieces of code. This is difficult when code has dependencies. The conventional solution is to write what are often called mocks. When a function under test calls some complicated component, there's no need to actually run that component in the test: if the component only gets called with two values when running the test, then just replace it with something that only works on those two values.
But mocks are under attack. They make tests depend on implementation details! They make tests pass when the system fails! Critics are making a mockery of mocks. Just in the last few months, I've seen Testing Without Mocks: A Pattern Language, Why Mocks are Considered Harmful, and Why Mocking is a Bad Idea.
Few defenders are mobilizing. Hillel Wayne only wrote a semi-defense of mocking. But still, there are parts of mocking that even he considers indefensible: having tests that a mock was called. In Java's Mockito framework, this looks like verify(marketingManager).sendMarketingEmail(user)
, which checks that, during test execution, the mocked MarketingManager object had its sendMarketingEmail
message called on user
. In JavaScript's Jest framework, it looks like expect(sendMarketingEmailMock).toBeCalledWith(user)
. This a recipe for tautological tests that need to change every time you tweak a parameter. No wonder it's been called indefensible!
In spite of that, I shall now argue why ideal testing actually requires doing this.
The Secret Brilliance of Mocking
Before we can ask whether a specific way of testing is good, we must ask what makes a test good. And for that, there's an absolutely beautiful theory of testing that explains what tests do and what should and shouldn't be checked in testing. I've alas never seen it written down anywhere (including by me; I've only taught it to a few dozen people), although I suspect that a decent number of practitioners of formal verification already have a pretty clear understanding of the ideas I'm about to present in brief.

In short, tests provide evidence for properties, and these properties compose into an argument that a program is overall correct. For example, a concrete test such as assertEqual(add(1, 1), 2)
can be seen as evidence that the add function correctly implements addition. It can also be seen as evidence that 1+1=2,
that add(x,y)
always returns 2, or that the result of addition is always positive. Understanding which properties any concrete test provides evidence for is equivalent to the age-old problem of philosophical induction, how to generalize knowledge from examples. Philosophical induction is full of paradoxes such as the claim that seeing a green emerald is evidence it will change color to blue next year, and the general problem is still debated, although constraints on program simplicity provide a partial resolution in the testing context (see my Strange Loop talk). But, however we resolve the riddle, this is what you are doing every time you come to believe a function works in general after seeing it work on examples.
Now, the properties a program satisfies can be written as logical formulas, and pieces of the logical formulas can be given names, much as we turn fragments of code into functions. So a job queue may be given a spec composed of properties such as "If a job is submitted, then it is eventually run," and the spec may be wrapped into a named predicate, JobQueueSpec
. We can now translate a sentence like "The module j
is a correct job queue implementation" as JobQueueSpec(j)
.
Now, suppose we want to write a web server that uses a job queue. I'll use WS(j)
to denote a web server implementation which imports job queue implementation j
. The web server correctness should only depend on the interface of a job queue, and not its implementation: we should be able to make non-breaking changes to the job queue without impacting the correctness of the web server, or replace it with a new implementation entirely. So we should not be trying to show WebServerSpec(WS(j))
, but instead ∀j
. JobQueueSpec(j) ⇒ WebServerSpec(WS(j))
. This formal notation is essentially a fancy way of saying: we should test the web server in isolation, independent of the job queue.
So the first big benefit of mocks is that it makes it possible to test modularized properties like ∀j. JobQueueSpec(j) ⇒ WebServerSpec(WS(j))
. On top of clarifying the common intuition supporting mocks, it also gives a nice framing for the commonly-criticized pitfall of mocks, which is that it's challenging to make sure your test is providing strong evidence for the "for all job queues" part and not just the "for this particular mock" part. The problem of philosophical induction strikes again.
But it's also possible to test this just by actually swapping in different full implementations of job queues. This brings us to the second big benefit, something that can't really be done without mocks. But first: how do we define "works?"
The spec for a function takes the form: given a set of inputs, the behavior of a function must satisfy some property. The behavior of a function has several components. The main parts of behavior are its outputs, i.e.: the actual return value, and its effects. Effects include mutating a shared variable, raising an exception, and writing to disk. It's also possible to consider other aspects of behavior which are neither outputs nor effects. Off the top of my head, there's also its resource consumption (time, memory, bandwidth) and things like what files the function reads from, which is an example of a coeffect.
Let's imagine a procedure for creating a new deal on an e-commerce website. It needs to do a number of things: update the product database, tweak the front page, and send a marketing e-mail. How do you test that the marketing e-mail was sent?
The basic effects available to a program are primitive: raise exception, set variable, make a system call. The behavior of a function contains a stream of these events: send packet 0x1234, send packet 0x5678, mutate variable, send another packet, etc. Describing sending a marketing E-mail in terms of these basic effects would be very cumbersome. And if you changed a low-level detail, such as moving the actual mail-sending daemon to another process or another machine, the effect stream observed would change completely.
But we can define certain patterns of basic effects to actually be higher-level effects. A certain sequence of packets constitutes the effect of sending an e-mail and a certain family of e-mails are marketing e-mails. Many other operations should be phrased only in terms of whether an e-mail or a marketing e-mail was sent, and changing the interpretation of what it means to "send a marketing e-mail" should not affect whether these other operations are correct. The correspondence between the idea "send a marketing E-mail" and its actual instantiation is a true abstraction, and so the ideas of "send an e-mail" and "send a marketing e-mail" are rightfully called abstract effects. And you can keep building this tower higher, perhaps grouping many possible sequences of e-mails into an even more abstract effect called a campaign.
There's a problem with these higher-level effects though: they don't exist. That is, they don't exist in the way that a tree exists; they exist in the way democracy exists. They are a pattern we impose upon the world. If you run strace
on your program, you'll still just see a stream of: send packet, send packet, send packet.
So how do we test that the function sent a marketing e-mail?
By calling expect(sendMarketingEmailMock).toBeCalledWith(user)
of course!
This is the punchline: mocks make it possible to test correctness of a function in terms of abstract effects.
And that's basically it. While mocks done poorly can increase dependence on implementation details, you really do need to write if you want to check that a function does some higher-level effect without referencing the concrete definition of that effect.
Does this really require mocks? Is there any other way to test abstract effects?
For the most part, I think the answer is no. Any solution needs to be able to take code that produces a stream of low-level effects and recognize that these low-level effects constitute a higher-level effect. These higher-level effects exist only in our mental models of code. But sometimes there is a function whose only behavior is to produce that abstract effect, and the only stable way to produce that effect is to call said function. Mocking exploits this correspondence. Without a correspondence between functions, which exist in the code, and abstract effects, which exist only in our mental models, the options are limited. My best idea is to use a tool like Yogo to scan the code or trace of a program and check that some code executed which matches one of the possible implementations of the abstract "send marketing e-mail" action. Another alternative is to write down the abstract effects in some other fashion. This is the approach taken in the paper "Bounded Abstract Effects", a paper from my old undergrad lab which details an extension to the research language Wyvern which builds abstract effects into the type system. I find this last one appealing. But, on the whole, it sounds like we're stuck with mocks.
Mocks may have been invented by a random London ad-tech company. But they actually map beautifully into a deep theory.