Arch-Engineer
How Software Gets Bloated
Software Design Tip: How Software Gets Bloated
Emin GΓΌn Sirer, professor of systems at Cornell, writes about how software gets bloated, drawing on an example from his time at AT&T.

Bloat is when a system becomes confounded by competing concerns, where it's impossible to work on the functionality of interest without thinking about a bunch of seemingly unrelated factors.
The first kind of system that gets called bloated is the one where there are 20,000 different variations of a function for every conceivable purpose. This code is definitely large, but not necessarily bloated. You might be able to write some docs for the "essential" parts, and then treat it like a lean machine.
The insidious kind of bloat comes from complicated conditions in your code. Sirer gives the example of a field which represents telephony information in some places and billing in others. Imagine trying to write the documentation for what the ianp_ain32
field does. It will definitely be a conditional of some sorts ("If in routing code, then it means X"). This is an invariant, and every invariant places a burden on the code: there are now more places where the invariant must be established. What was once a simple matter of calling a function in another component is now a game of save-and-restoring all the fields; what was once a matter of glancing at a data structure to get information is now a complicated process of figuring out where in the call path you are. That's making easy things hard. That's bloat.
For fun, I had a peek at how this idea plays out in the Windows API, perhaps the most famously bloated API of all. The very first result I found was the docs for the PlaySound function. It says of the second parameter hmod
"This parameter must be NULL unless SND_RESOURCE is specified in fdwSound," while it says for the first "A string that specifies the sound to play. [...] If this parameter is NULL, any currently playing waveform sound is stopped."
ianp_ain32
is either switch info or billing info. hmod
is either meaningless or not, depending on a different parameter. And the PlaySound
function is actually sometimes the "stop playing sound" function.
It's just a three parameter function, but the description of what it does is quite complex, and ties many things together. It takes a clever programmer to deal with it, and that's why cleverness is dangerous.
Research Corner: The Programmer's Apprentice
This time, we're going to visit a real programming classic. The Programmer's Apprentice project of the 70s and 80s is something of a granddaddy to modern program synthesis work. Their vision was grand: to create a system that could act as a "junior programmer" on your team, and do low-level programming tasks from requirements all the way to optimization. Their motivation was strong, but, when you look at the programs they were trying to synthesize, it was often just a "sort" function specialized to a datatype. This 1987 paper gives a nice overview of the project.
Through a modern lens, the main idea of the Programmer's Apprentice is to represent program constructs as dataflow patterns, which they call "cliches." They look like this:

On the right, they have a high-level representation of pushing onto a stack, which corresponds to the low-level representation on the right. These representations consist of values and the operations done on them, wired together like a pure functional program. The high-level representation of the push operation, in turn, can be used as a building block for yet a higher-level component. Using this, they can reverse-engineer some pretty interesting components automatically:

How big can this scale? I know a company which is using this technique to automatically reverse-engineer the control software for many chemical plants.
Here's how dataflow patterns show up in your everyday work. If you wrote functions to compute the max and sum of an array, you'd probably write something like this:
int computeMax(int[] arr) { int max = Integer.MIN_VALUE; for (int i = 0; i < arr.length; i++) { if (arr[i] > max) max = arr[i]; } return max; } int computeSum(int[] arr) { int sum = 0; for (int i = 0; i < arr.length; i++) { sum += arr[i]; } return sum; }
But if you've ever had to compute both a max and a sum, and didn't feel like finding a library function, you may have just banged them out together in one loop, like this:
void foo() { ... int[] arr = ...; int sum = 0, max = Integer.MIN_VALUE; for (int i = 0; i < arr.length; i++) { int x = arr[i]; sum += x; if (x > max) max = x; } ... }
It's not so straightforward to extract out the computeSum
from that code. The loops are fused together, and there's an extra variable to throw a wrench in the works.
Yet if you were to compare the dataflow graphs of both functions, you'd see the embedding of computeSum
in foo
, plain as day. There's still a value being read from the array and summed onto a variable. Here's the dataflow graph of computeSum
:

And here's foo
:

There are several lessons to draw from this example. One is that units of code smaller than a function can be semantically meaningful, and are often better building blocks. Code like the loop in foo
often arises when a programmer gets dissatisfied with the performance of two separate functions doing their own loops. Yet, if the computation was exported in pieces, as the loop body and the initialization, then they could be combined modularly. This is the insight behind the foldl library for Haskell.
But the biggest lesson is a reminder of the primacy of data structures over code. Even as we scramble and fuse the code, the basic relationship of reading from one source and doing a computation on it remains stable, and so we can recognize a program that's computing a sum even in the midst of spaghetti. In the words of Linus Torvalds: Bad programmers worry about the code. Good programmers worry about data structures and their relationships. So think in terms of data.