Arch-Engineer
A Serialization Format is not a Programming Model
A student approached me last month with an architectural problem. His company, a large and successful consumer website, was sharing information wantonly between microservices. Servers running marketing software were getting packets containing user's credit card and social security information, and the company was getting nervous about accidental leaks.
As he progressed in describing the problem, I started to explain how he had a general concept of user information and how it would help to make it explicit in the code, and a clean way of doing so....
...and I found he couldn't use it. For this technique required changing their data structures to use a few interesting features of their programming language. But their internal data structures weren't written in a general-purpose programming languages. They were written in an interface-description language, the one used to describe messages between servers.
Within the space of a few days, I saw multiple others make a similar mistake in different contexts. And so, I need to say it loud and bold:
Data defined in an IDL is a serialization format. And a serialization format is not a programming model.
What you lose with serialization formats
Obviously, general languages and IDLs are designed for different things, but there's two major things you lose when using a serialization format to represent data.
The first is pointers. Pointers mean locality. If you serialize a user with an account, you write down their account ID. If you use this "user has account ID" structure in application code, all of a sudden it's impossible to write code that looks up a user's account without also having access to a global store of all account IDs. Not good.
The second is types. Serialization formats have types too, but the interesting uses of types give distinctions between things that have the same runtime representation. A user's ID and age must never be confused, and so it helps to give them different types. But, in serialization, they're reduced to their binary representation: an integer.
Real-world examples
What were the instances of this mistake? I took to Twitter:
Today, in "serialization formats are not programming models": Do not base your code's design around a REST API. REST is just a way of sending messages across HTTP. Programming with REST is programming with global variables. #softwaredesign
— Jimmy Koppel (@jimmykoppel)
Several students did this in design exercises I assigned. They always had bugs like: starting to create a new item and then hitting "cancel" would create it anyway. REST API means IDs, IDs mean a global data store, and a global data store means nothing exists in isolation.
There is one good reason to program with a serialization format: it makes serialization faster. But it can also make computation slower, thanks to the loss of pointers. Adding in the code quality costs, the tradeoff is rarely worth it.
So what can we do?
So, back to the original guy, whose company built their codebases around serialization formats. What can he do?
Lipstick on a pig is all I can offer. There's a lot they can do to mitigate the damage, but they're living with a flawed design. Either everything they do is harder and they start accidentally breaking privacy laws, or they refactor the whole codebase to undo this mistake. It's a choice between heart problems and heart surgery.
There's only so much I can do for this old codebase, and so I look to the new. I'm trying to build a world where software doesn't need heart surgery. I hope this E-mail helps keep your codebase away from the surgeon's knife.
Happy coding,
Jimmy Koppel