A few years back, someone (badly) explained Google’s operations team structure to me, telling me that Google called them “SRE’s” and it meant that developers were on-call, that nobody actually did operations work and their whole world is just developers on developers, and then developers all the way down. As a multi-decade, traditional “ops” guy, I dismissed it as fanciful at the time, though did start to notice companies all over the place start to call their “ops” folk “SRE’s” regardless of whether that changed their job description. I gradually righted my mis-perception of what site reliability engineering is, but only just recently made time to read the O’Reilly Site Reliability Engineering book, thanks in part to the Audible version that I could listen to while on long bike rides. I wanted to share my thoughts on this as it relates to infrastructure for portly and natively cloud-unfriendly CMS implementations like Adobe Experience Manager.
What is Site Reliability Engineering?
SRE is a term (and an associated job role) for the engineers whose job it is to enable the business they’re working for by making a desired service run smoothly and reliably. This also may be what operations folks like myself thought they’ve been doing all along (mom, can I be called an SRE too?) but there are a few key differences that define the SRE approach (mostly, in this case, shamelessly lifted from The Site Reliability Engineering Workbook