Plus Ça Change Management: 20 Years of SRE
Fri 19 July 2024
On July 19th 2004, I spent my first day at Google. I showed up at the Datacenter in Dublin, since that's apparently where SREs and production folks were going to sit. There wasn't a corp network connection, because this was Google in 2004. A couple of days later, I moved my stuff to Barrow Street and commandeered a desk. A few weeks after that, someone from facilities asked "Hey, isn't your team based in the DC?". "Nope", I said. They shrugged, and updated some spreadsheets, and then SRE was based in Barrow Street. Again, this was Google in 2004. We were making it up as we went along, in the best possible way.
I could claim I was doing SRE-adjacent things before this, as I suspect could many people, but I'm going with that day as when I became an SRE. The function was still figuring itself out; in many ways it still is. My work with Busy Teams as a freelancer isn't "SRE work" on paper, but in practice, it very much is. Resilience, defense in depth, common sense, backup plans for backup plans. Across industry, SRE/Reliability/Devops/ProdEng/Whatevsies is many things to many people. So, in the spirit of the core of the function, I've been thinking a little about uncomfortable gaps in capability. What have we not figured out yet?
Here's my short list of boiling hot takes:
-
Most companies haven't figured out how much they steady-state care about reliability. I expand on this in "6 Reasons you Don't need an SRE Team".
-
Assuming we've figured out that we care, we still can't agree how hard a problem this is. SRE (and traditional ops) begat DevOps, which at its core has the premise that busy developers could side-gig the production bits, and you can just wave your hand and say "You build it, you run it" and that'll cover you. This isn't true today, and was even less true when 'DevOps' started being a thing. If I squint my eyes, I can see a course correction happening with "Production Engineering", which has at its premise a lot more sensible of an acknowledgment that this whole area is hard (as in, requires smartness, innovation, and Real Engineering(tm)) as opposed to difficult (as in, it's boring and I don't want to do it). These are glacial shifts; it's taken more than 20 years for us to go in this circle back to acknowledgment that this is a real and specialised set of problems.
-
We actually run less and less of our own infra, and practices need updating. For a brief period before AI was suddenly what all startups are about, there were (and still are!) great startups producing big parts of your tool-chain as SaaS/PaaS. This is great for not having to build in in-house expertise in that particular area, but can leave you dead in the water if you don't have a good strategy around vendor management. I spoke about this a bit at SRECon EMEA 2023, and I do feel like a lot of our practices involve declaring an outsourced part of our tool-chain to be an opaque cuboid, and then not having defense in depth for when it goes away (temporarily or permanently). Many of the SRE practices as set out in various books/articles kind if assume you own your whole stack. This is becoming mostly untrue, and we are currently a Frog of Moderately Troubling Temperature here.
Anyway. It's Friday, it's 22 degrees outside in Dublin, and it's time to go enjoy the next 20. No shortage of things to do.