Bad Machinery Paper
Wed 19 May 2021
Thanks to my accomodating erstwhile colleagues, I'm able to publish the original paper (with minor redactions so it makes sense for an external audience). A version of this paper later became Chapter 29 of the SRE Book.
To add a little bit of extra colour, the document came out of some lessons learned on Storage SRE (specifically Bigtable/Colossus), that I ran directly back in the early 2010s -- The date I have on this document is 2016, but I know it was around long before that.
Storage SRE back then was something of an anomaly -- there were a number of teams that managed storage services (the above, but also some various other services, some of which where purely internal and never saw the light of day). To amalgamate these services meant that we potentially won a bunch of OMG synergy because of the base assumptions made in design that were unique to services that kept data. Google started out as a search company with a throwaway, re-scrape-able corpus (built on GFS, which wasn't designed to persist data), so it was natural for us to have a couple of goes at stateful services, and see what stuck.
So, while you might imagine that lumping services of a similar shape together is a good thing to reduce duplication of effort (and you'd be mostly right), other factors include service maturity, generalised attitude toward production, and even come down to the vagaries of the dev/SRE relationship, or of engineering leadership as a whole.
So around then, we had a Bigtable and Colossus service that generally worked remarkably well. We had challenges in observability, and in developing the tech around load-sharing of users on a shared compute/storage pool. This later became known as running a multi-single-tenant architecture, although at the time we saw it as fairly chaotic - not all the tooling had been done, and some of the assumptions and prior art we have the benefit of today wasn't a thing, yet.
While others looked at this as a purely technical problem, myself and a couple of key leads saw this as a problem of focus. Google in its earlier days had faced additional problems of scale like this -- where esentially we could keep digging, or we could put in place a set of attitudes and approaches that enabled us getting out of the hole we were in as far as interrupts are. Some of these problems fell into the "We will need 5000 people to edit config files" or "We will need more hard drives than humanity is making" variety, which really do need 'creative solutions').
One of the anti-patterns we observed at the time was a tendency among folks on both Dev and SRE teams to concentrate on speedy detection and resolution of issues (The 'Rocks with Eyes" method, as succinctly described at the time). Being able to know when things are busted is important -- but the followup is key. Oncall and interrupts are fundamentally a waste of time and should be minimised, so signing up for extra seems not a great approach.
I had a separate paper related to this one called something like 'email alerts are from the past' -- and my real-life experience of that was coaching some key team members through some of the above approaches. We had folks who felt that they needed to keep getting email alerts, as they had trained themselves to spot patterns, and believed they needed to keep doing this or we'd have more outages. You might step back from this a wee bit and recognise that this was a set of people who were inadvertently being profoundly disrespectful of their own time. It's not because they were bad people of course, it's because they were busy people, and firefighting had found its way into their muscle memory.
In the end (in this example), myself and senior TL on the team ended up unilaterally turning off email alerts. She sent me the code review, I approved it, done. It was clear that we weren't going to reach consensus on whether or not it was a good idea; and while I'm not a fan of this approach in general, it became clear we weren't going to get there (see also the 'consensus is nice' section). In this case, nothing happened. A few folks were briefly sad, but ended up with more time on the clock and more spoons.
That, to my mind, is the outcome of a successful strategy around interrupt management. The interrupts need to be addressed, and in a timely way -- but you as the team responsible hold the ability to do it in a way that doesn't treat people as machines. In addition, interupts are not a closed system - the interrupts you're getting a month from now need to be different from the ones you're getting today -- by finding a better set of first world problems.
But moreso, the way you give yourself the ability to step back and really address the root causes in a noisy system is to respect your time, reduce context switches, and give people both the well-clock time and the spoons to be able to do so.