Sign Up Today
☀️ 🌙

Arch-Engineer

It's not magic. It's obvious.

Software Design Tip: Fancy Simplification is Obvious

I was recently recommended Raymond Hettinger's PyCon 2015 talk on code quality. If you're like me and you don't want to spend 50 minutes watching this talk, then I recommend turning on closed captions and turning the speed up to 4x — or just read my summary below.

The talk is commentary on a refactoring he did. He starts with this 20-line program, and explains small tweaks one might make to make it conformant with Python coding guidelines.

"How many people would check this code in?" he asks. Hands go up.

His trap is sprung, as he shows a 3-line program equivalent to the 20-line example.

with NetworkElement('171.0.2.45') as ne:
    for route in ne.routing_table:
        print "%15s -> %s" % (route.name, route.ipaddr)

He spends the rest of the talk explaining how this example works.

But it's not enough to know how to do this refactoring. You need to become one of the people whose hand wouldn't go up.

This second refactoring is a bit different from the first one, as he also had to change the API used. There are a handful of simple heuristics that can guide you to the changes he made: prefer short names, and iterate over collections instead of indices. The latter is an example of a more general operation called "fusion," and also a small example of information hiding: hiding that elements of the routing table have a meaningful index.

But the largest change by far is from the try-except-else-finally control structure to the simple with statement. Plenty of programmers might see code like sqrt(x*x+y*y) pop up and think "Maybe I should break this into its own function?" This is factoring out common subexpressions. But equally important is seeing code like getRoutingTable(); <.... do stuff...> releaseRoutingTable() and thinking to break the outside code into its own higher-order function. This is factoring out common super-expressions, and it's an alarm bell that should trigger when presented with Hettinger's example.

Now suppose you knew that getRoutingTable() would only be called once? Thing is, the try-except-else-finally control structure is itself an oft-repeated pattern, the one that Python's with construct itself is meant to abstract. Separating the logic to acquire and release a resource from the code that uses the resource is a pattern seen in many languages. It's a pattern seen with the AutoCloseable interface in Java 7, the RAII pattern in C++ and Rust, methods like File.open in Ruby, and the bracket function in Haskell. It's a pattern that goes at least back to the 80's, when Common Lisp implementations added the with-open-file macro.

Look for common control structures and abstract them out. There really aren't that many such patterns in common use. Internalize these principles, and you'll have an internal force that makes you unable to write bad code, and you'll shine in code reviews. Unless you work for me — I'd insist you fix the hard-coded IP address first.

Research Corner: Query by Synthesis

Sometime in the last few weeks, I accidentally morphed into a databases researcher. Right now I have a 1707-page document describing SQL sitting on my desk, and today I'd like to share some work automatically refactoring database code.

Query by Synthesis diagram

Programmers are pretty used to programming with lists, happily iterating over them just to find one element. Carry that habit over to an ORM like Hibernate, where "getUsers()" is actually an expensive database call, and you're in for a lot of wasted CPU cycles.

Enter QBS. Alvin Cheung's "Query By Synthesis" system takes code that iterates over database-mapped collections, and turns them into complex SQL queries that do the same thing. It does this by inferring a logical formula describing the state of the program at every point in the code to be transformed. Once it's found a logical formula describing the output, it spits it back out as SQL. Now all that computation can be done inside the database with its fancy query optimizer, and your network can be spared communicating unnecessary results.

QBS transformation example

This automated refactoring leaves us with a couple lessons about manual refactoring. To many programmers, refactoring is a series of small steps. Break up that method, merge those two lines, use a list of pairs instead of a pair of lists. This approach can perform miraculous reductions to your code, but, like in the example in Raymond Hettinger's talk, when one of those small steps is actually a great leap, it can also leave you stuck in a corner.

QBS works differently: it tosses the existing code out entirely, and creates something entirely new based on the semantics. Try and see if you can break this transformation into a series of steps. It shows why many of the best refactorings require taking a step back from the current code, asking "what is this actually trying to do," and going from there. (The QBS transformation actually does decompose into smaller steps, but those steps are transformations between code and logic, not between code and code.)

But there's another, more insidious, lesson to be had from this example.

Take a close look at the output. There's something there that maybe shouldn't be there: ORDER BY u.roleid, r.roleid.

The original ORM code iterated through the lists by order of ID, and the output kept that order. Is this a coincidence, or intended behavior? QBS doesn't know, so it has to be conservative and keep it.

And, as those who have attended my workshop on Code Knowledge already know, it's literally impossible for QBS to tell just by looking at the source code. Suppose this list of users was going to be displayed on a webpage. Is it a bug if they suddenly start appearing in a different order, or is it simply different behavior?

This is how code can become entangled with distant parts of the codebase despite the apparent lack of a connection. This is why you need to make it clear which behaviors of your code may be relied upon and which are mere happenstance, either through documentation, or by using data structures with more or less specific invariants. For instance, if you wanted to make clear in this example that the returned order doesn't matter, you could return the users as a Set, or even as a Collection. Modularity comes from the gap between an implementation's behavior and its spec's guarantees. Be specific in your implementation, but vague in the spec.