What is Site Reliability Engineering?
SRE is a term (and an associated job role) for the engineers whose job it is to enable the business they’re working for by making a desired service run smoothly and reliably. This also may be what operations folks like myself thought they’ve been doing all along (mom, can I be called an SRE too?) but there are a few key differences that define the SRE approach (mostly, in this case, shamelessly lifted from The Site Reliability Engineering Workbook:
Operations is a Software Problem: The basic tenet of SRE is that doing operations well is treating it like a software problem. SRE should use software engineering approaches to solve problems in operations.
Manage by Service Level Objectives (SLOs): SRE does not attempt to give everything 100% availability. Instead, the product team and the SRE team select an appropriate availability target for the service and its user base, and the service is managed to that SLO.
Running an on-premise or self-hosted Adobe Experience Manager environment has some intrinsic factors that I’ve seen can lead one away from an SRE-style culture, and more one of traditional, old-school “Ops”. This means leaning more toward reactive operations, and less toward continually automating yourself out of a job, and working to minimize “toil”.
AEM installations I’m working on at this very minute tend to be very manual, and run by more traditional operations methods, with the application, web server, load balancers all installed and configured manually, little-to-no configuration-as-code, etc. In this day and age this might seem shocking, but the reality is that there are several reasons why companies use such a dated and problematic operating model with such an important and expensive system which almost invariably costs millions to deploy and maintain.
It’s unfortunate, but the AEM licensing model is a big root cause, in my experience, for why companies haven’t gone to a more devops-ish approach to their AEM installations. A large majority of installations I’ve run have been 1-Author / 2-Publisher systems, mostly because that was the most that the company was willing to shell out for. If you’ve only got two Publishers (the application server in an AEM environment) and never will have more, there’s diminished benefit to automating things like instance provisioning, auto-scaling logic, and the like – as you’ll always have the same two persistent Publishers all the time