☀️ 🌙

Arch-Engineer

By Mirdin

3 Unusual Code Quality Ideas

Next-Generation Semantic Code Search

Ever need to refactor something in a large codebase, and have trouble finding all the places that need to be changed? Today, I'm proud to unveil an exciting project we've been working on for a few years.

This morning at PLDI, my master's student Pond Premtoon and I presented our paper "Semantic Code Search via Equational Reasoning," featuring our new Yogo tool (You Only Grep Once). From a query written in its high-level language, Yogo can find nearly any equivalent code in a large codebase.

Yogo builds on an old insight I've previously written about, that, even when there are many variations on how to write something, the dataflow graphs may stay the same. Combined with a powerful equational reasoning engine, Yogo can match code even if it uses a different API, or different but algebraically-equivalent expressions. Combined with the ability to identify abstractions, one can write queries that match any way to e.g.: iterate through a sequence or read from a file, to the point where a single query can match code in multiple programming languages.

The 1-minute teaser video is above, the full talk is here, and the paper is here. A Docker container running this tool is available here. This is still a research tool, but we plan to invest more into it, and are excited to talk to anyone interested in using it.

Three Small Code Quality Ideas That Make a Big Difference

In my courses and writings, I've explained some very big ideas about how to design systems for better maintainability and extensibility. But I've also developed several smaller ideas, which I present here for the first time. In order, from lowest-level to highest-level:

Grouping in Files and Functions

You have a 30-line method. There are chunks that do smaller things. Do you break it up into smaller methods? If you do, then you've properly labeled the subpieces, but added some indirection. If you don't, the reverse holds. Also, breaking it into pieces may result in methods with a lot of parameters.

But you can eat your cake and have it too. I've become a fan of this style of programming, available in any curly-braces language, where you use labeled blocks to group code at a sub-function level.

void initDriver() {
  void *dmaAddress;

  findAddress: {
    // ...
  }

  sendStartupMessage: {
    // ...
  }

  sanityChecks: {
    // ..
  }
}

If you wrote this function by thinking "First find the address, then send startup messages," then labeling these blocks preserves this part of the design process, an instance of the Embedded Design Principle. If you ever do design to break out sub-functions, then there's no need to figure out how; the thinking has already been done. As an added benefit, this style makes it clear which variables are long-lived throughout the function body, and which are only done locally.

A similar idea applies when placing functions in a file. I like to break them into subcategories, with a nice block comment separating them. Here's how it looks in a file implementing an interpreter for a DSL:

/*********** Term definitions *********/

class Expression { .... }

class Statement { .... }

/***** Helpers for creating terms *****/

Expression exprFromString(.....) { ... }

/*************** Execution ************/

Value lookupVariable(.....} { .... }

Address resolveJumpTarget(...) { ... }

void evalStatement(...) { .... }

/************** Type checking ********/

void checkExpression(...) { ... }

void checkStatement(...) { ... }

Unlike the previous idea, each labeled cluster may not correspond to a distinct design intention. Nonetheless, this helps greatly in quickly digesting what is in a file. And if you ever want to split this file into several, the thinking has been done for you.

Name Catalogues

A few weeks ago, on my Advanced Software Design Slack, we had a long discussion about abbreviations in code. Many were against them, and some were for, but eventually we reached the consensus that name length should be inversely proportional to its scope and that the real issue, rather than abbreviations vs. not, is how readily someone can understand the name, which depends on the number of potential readers.

But, since then, I realized something deeper about names.

I've spent a good chunk of the last few weeks ramping up on the codebases of several large preexisting static analyzers. I've long advised that people understand codebases by understanding the data structures, which I followed myself. And while this worked well on one of the codebases, for some of the others, I still found myself lost even after doing that.

And then I realized: even though I read every single data structure definition, I didn't actually understand what they meant. I can read that "a cluster link has a creation time, send buffer, receive buffer, and connection," and have a pretty good idea of what a cluster link is and what you can do with one, but this codebase made heavy use of terms like "leaf" (and not in the tree sense), as well as vague terms; it did me no good to learn that a "SharedText" has an ID, contents, and kind.

There are several pieces of documentation I recommend for newcomers navigating a codebase. The first one is a guide to key files and data structures. But this experience sparks the idea of introducing a name map, an explanation of commonly-used terms in the codebase.

I have already put this into practice in a minor fashion by cataloguing the naming scheme in my Cubix framework. I have yet to try this idea in anger and see its effects, but it has high promise.

Make The Debug and Real Behavior Differ

APIs have a way of lasting decades. The more widely used, and the more interoperability required, the more true this is. If you've ever had a bad experience with an airline, chances are at least part of it was due to the airline's SABRE system, with APIs that have been around for 60 years.

When someone's program depends on behavior of your API that's not guaranteed, you may never be able to change that behavior.

I recently did some consulting in API-design, and used this idea heavily. But there's one instance of it I came up with which I find particularly clever.

Consider the lowly C structure:

struct Foo {
  int a;
  int* b;
  int c;
}

What promises does this make? That there is a structure called Foo, with fields a, b, and c. What behaviors does it have? Well, commonly, on a 32-bit system, it will also have the behavior that Foo is 12 bytes, and a, b, and c are at offsets 0, 4, and 8. If anyone's system relies on this behavior — say, by allocating a buffer of exactly 12 bytes — then it will break when you try to remove or reorder a field, or compile for 64-bit.

Here's my trick for preventing this:

struct Foo STRUCT(
  int a,
  int b,
  int c,
)

Here, STRUCT is a macro which behaves differently for debug and production builds. For production builds, it is the same as the previous snippet. But for debug builds, it reverses the order of the fields, and prepends some padding.

If a user tests their code against both the debug and production builds, then anything which depends on the structure size will break, as will code which depends on field offsets.

Causality in Programming

It turns out I do things other than programming tools. A few months ago, after giving a seminar at the University of Chicago, Matt Teichman ambushed me to appear on his philosophy podcast, Elucidations. I sat down with him to explain some of my work with Zenna Tavares and Xin Zhang on automated counterfactual reasoning, and more generally to discuss the theory of causal and counterfactual inference and how it can be used inn programming and programming tools: https://elucidations.now.sh/posts/episode-125/

Somehow, different research interests have a way of coming together. Together with Daniel Jackson, I've written a new paper arguing that "dependence" in software engineering is a form of causality (specifically "actual causation"), and using this insight to present the first formal definition of dependence, which can hopefully one day be used to write tools that e.g.: tell you if your code depends on any special features of MySQL, or if it can be easily switched to PostGRES. Stay tuned!