0% found this document useful (0 votes)
648 views31 pages

Site Reliability Engineering Handbook

The document discusses site reliability engineering (SRE) philosophies and principles. It explains that SRE is a cross-functional role that assumes responsibilities traditionally held by development, operations, and other IT groups. At Google, SRE teams focus on hiring software engineers and aim to balance system reliability with rapid innovation through automation. The goals of SRE are to make systems reliable enough without over-engineering reliability at the cost of new features or technical debt. SRE teams track key metrics like availability, latency, throughput and other "Golden Signals" that define system health rather than just focusing on uptime.

Uploaded by

shailesh.dyade
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
648 views31 pages

Site Reliability Engineering Handbook

The document discusses site reliability engineering (SRE) philosophies and principles. It explains that SRE is a cross-functional role that assumes responsibilities traditionally held by development, operations, and other IT groups. At Google, SRE teams focus on hiring software engineers and aim to balance system reliability with rapid innovation through automation. The goals of SRE are to make systems reliable enough without over-engineering reliability at the cost of new features or technical debt. SRE teams track key metrics like availability, latency, throughput and other "Golden Signals" that define system health rather than just focusing on uptime.

Uploaded by

shailesh.dyade
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Site Reliability

Engineering
Philosophies, habits, and tools for SRE success
Site Reliability Engineering

Table of Contents
Introduction01

Chapter 1: SRE Philosophy and Principles 04

Chapter 2: What Makes an SRE Successful? 10

Chapter 3: SRE Tools and Processes 18

Chapter 4: The Evolving SRE Role at New Relic 23

Execution29

02
Site Reliability Engineering

Introduction

The day-to-day responsibilities of developers and operations


engineers are increasingly evolving as high-growth companies
look for new ways of improving stability, reliability, and auto-
mation-first practices. Because of the need to reduce downtime
(with less manual intervention) as systems scale, many organiza-
tions are adopting the site reliability engineer (SRE) role.

The phrase “site reliability engineering” is credited to Benjamin


Treynor Sloss, vice president of engineering at Google. Sloss joined
Google in 2003 and was tasked with building a team to help ensure
the health of Google’s production systems at scale. According to
Sloss, site reliability engineering is “what happens when you ask a
software engineer to design an operations function.” Site reliabil-
ity engineering is a cross-functional role, assuming responsibili-
ties traditionally siloed off to development, operations, and other
IT groups.
Site Reliability Engineering

Sloss’s team wrote the original book on site reliability engineering, Google’s answer has been to hire software engineers to do the
so if you’re wondering what a great modern SRE practice should work usually handled in traditional organizations by IT operations
look like in a DevOps world, the Google Site Reliability Engineer- teams. “Our site reliability engineering teams focus on hiring
ing book is a fantastic point of reference. software engineers to run our products and to create systems to
accomplish the work that would otherwise be performed, often
In it, Sloss writes, “It is a truth universally acknowledged that sys-
manually, by sysadmins,” explains Sloss.
tems do not run themselves. How, then, should a system—par-
ticularly a complex computing system that operates at a large
scale—be run?”

02
Site Reliability Engineering

From Google to the rest of the world Starting your SRE journey

After the book was first published, the role was rapidly adopted in It is important to note that the SRE role will vary considerably
a wide range of companies, prompting technology news and anal- from one organization to another. While job descriptions and day-
ysis site TechCrunch to wonder, back in 2016, “Are site reliability to-day tasks for SREs vary, the role’s utility is quickly becoming
engineers the next data sci- apparent to those software
entists?” The following year, organizations that have
LinkedIn named SRE “one of adopted it. So, where does
the most promising jobs” in
tech. Speaking in 2018, Beth
The demand for that leave you?

SREs remains
Whether you’re still figuring
Long, a software engineer at
out how to create a site reli-

higher than ever.


Jeli, told us, “My impression
ability practice at your com-
is that there’s a slow trick-
pany or trying to improve
le-down to smaller compa-
the processes and habits of
nies. Google, and Netflix,
an existing SRE team, the
and Amazon, and Heroku—
more you know about the subject, the better—especially since
these companies have had SREs for a long time, because they
what works for a massive company such as Google might not
have the resources and the scale that demand it. You’re starting
work for a small or midsize outfit. To that end, this ebook shares
to see that role appear in smaller companies where they realize
the philosophies, habits, and tools of successful SREs, along with
‘Oh, we need someone to play this role.’”
New Relic’s definition, guidelines, and expectations for the role.
Three years on, this remains true. As more organizations are
building distributed microservice-style systems that run at scale,
the demand for SREs remains higher than ever.

03
Site Reliability Engineering

CH A P T E R 1:

SRE Philosophy
and Principles
Google defines an SRE as an operationally minded software engi-
neer, but what does that mean? At Google, SRE teams are respon-
sible for both capacity planning and provisioning. The teams are
different from purely operational teams in that they seek soft-
ware engineering solutions to problems. To enforce this, Google
caps the amount of time SREs spend on purely operational work
at 50%. This means that, at a minimum, 50% of a Google SRE’s
time should be allocated to engineering tasks, such as automa-
tion and improvements to the service.

The goals, risks, and trade-offs of


Site Reliability Engineering

When first thinking about an SRE team’s role, you might assume
that increasing reliability, generally measured by monitoring sys-
tem uptime, would be the primary goal, but beyond a certain
point that turns out not to be the case. This is because factors
outside of the SRE teams’ control come into play, such as network
reliability. There is also a trade-off between reliability and devel-
opment team velocity.

04
Site Reliability Engineering

Because of this, site reliability engineering will generally seek to For a relatively young product, setting a stringent goal for the
balance the risk of unavailability with the goal of rapid innovation uptime of a service will likely be counterproductive, because it will
and efficient service operations. In the Google SRE book, Marc reduce the pace of innovation and experimentation in an unde-
Alvidrez writes, “We strive to make a service reliable enough, but sirable way. Conversely, as a product reaches maturity and has a
no more reliable than it needs to be. That is, when we set an avail- base of customers that depends on it, downtime becomes more
ability target of 99.99%, we want to exceed it, but not by much: problematic and can potentially have a direct impact on the ser-
that would waste opportunities to add features to the system, vice provider’s bottom line. At this point, increasing the target for
clean up technical debt, or reduce its operational costs.” uptime makes sense.

One way to think about and manage this trade-off is to consider


the point in its life cycle at which a given product or service is.

05
Site Reliability Engineering

Golden Signals

While measuring service availability is a good starting point, particularly for user-facing
services, SRE teams will typically have several other business-oriented key metrics that
they also track. These metrics, which often include the four Golden Signals, are best
thought of as defining what it means for a given system to be “healthy.”

Different types of applications will have distinct metrics. For example, user-facing ser-
vices might care about availability, latency, and throughput, while big data systems tend
to focus on throughput and end-to-end latency. It is worth noting that the measurement
isn't an end in and of itself. What is important is how it indicates the quality of user expe-
rience and system effectiveness.

06
Site Reliability Engineering

Using SLOs and SLIs to measure reliability

Service level objectives (SLOs) are a common way to measure a service provider’s perfor-
mance and can be equally important to site reliability engineering success. Clearly defined
and measured SLO metrics at the product and service level help organizations to:

• Tune investment and overall prioritization to meet reliability goals and meaningfully
adjust those high-level reliability goals to fit company strategy
• Maintain and build customers’ confidence
• Enable teams to decide when and how to focus efforts on reliability
• Allow engineers to make better assumptions about risk tolerance and how fast they
can go, and reason better about dependencies and reduce unnecessary toil

As an example, Stephen Weber, a Senior SRE at New Relic, told us that the New Relic core
data platform has three key metrics: The first is correctness (Were the correct results
returned?), and the second is latency (Did it respond in an acceptable amount of time?).
“And then the third is, to ensure that they are getting good latency, a safety valve technique
to stop processing and provide partial results (also known as graceful degradation). And
so they have a third SLI of keeping that to a minimum.” The three metrics together form
this SLO for the core data platform.

If teams consistently exceed their SLOs (for example, 99.9% availability for all services),
they may be able to move faster, take on more risk, and deliver more features. If a team

07
The hallmark of a good SLI/SLO is the
Site Reliability Engineering

metric’s relevance to the business


outcomes, often the user experience.

is in danger or isn’t meeting its SLOs, it’s a signal to back off and issues and risks that will have a quantifiable impact on SLOs, and
pause to focus on reliability so that the team can start moving they also allow organizations to downshift on issues that may not
faster again. be especially urgent.

SRE teams may also have other service level indicators (SLIs) that The hallmark of a good SLI/SLO is the metric’s relevance to the
they use to measure reliability that are not necessarily part of business outcomes, often the user experience. For example, a
their SLO. These performance metrics track some facet of the high error rate or slow response time has a negative impact on
business; for example, an SLI for a database service could be the user experience. High CPU utilization might have a negative
something like, “The fraction of user queries that are successfully impact on the user experience, but the relationship between high
completed within 200 milliseconds without error.” CPU and a bad user experience is harder to establish.

To measure reliability, teams turn to metrics like mean time


between failures (MTBF), mean time to detect (MTTD), and mean
time to resolution (MTTR), all of which help organizations define
their “risk matrices.” These become powerful tools for prioritizing

08
Site Reliability Engineering

Error budgets Of course, if the error budget is too tight, it can slow the pace
of development. Having an error budget in place allows you to
Finally, although perhaps not essential, it can be helpful to define
reason about this and make a decision to perhaps relax it in order
a quarterly error budget based on a service’s SLO. The error bud-
to be able to increase development team velocity. In that situa-
get provides a clear, objective metric that determines how unreli-
tion, the product and SLA engineers might decide to increase the
able the service is allowed to be within a single quarter.
allowable error count to enable faster development. Some orga-
Teams can have burn-down charts that show how quickly they nizations sort their apps into “high reliability” and “high velocity”
are going through their error budget, and adjust work accordingly. and set stricter/looser error budgets accordingly.
Interestingly, at Google, if a service is providing 100% uptime,
Whatever the setting, an error budget is important because it
they will take the service down so dependent services are forced
aligns incentives and emphasizes joint ownership between soft-
to know how to react.
ware engineering and product development.

09
Site Reliability Engineering

CH A P T E R 2 :

What Makes an
SRE Successful?
When choosing an SRE, a candidate’s technical contributions will
depend on how a particular organization defines or approaches
the role: One company might require more software engineer-
ing and coding experience, whereas another organization might
place a higher value on operations or QA skills. Whatever the bal-
ance, what sets the “great” apart from the “good enough” is often
a combination of habits and traits that complement technical
expertise.
Site Reliability Engineering

Here’s how you’ll know you’ve found a fantastic SRE. The ability to consider how their work will affect the rest of a
particular system, team, or the larger infrastructure is the kind
SREs see the (much) bigger picture
of extreme pragmatism that SREs need. There’s little long-term
Successful software developers understand how their code helps upside in a siloed approach that throws a change over the wall
drive the overall business, and great SREs have their own version with no concern for how it might affect the person sitting on the
of this trait. “You’re looking for someone who is thinking about other side.
the bigger picture outside of the day-to-day,” said Jason Qual-
“We are making decisions very low in the stack,” Qualman said of
man, a Senior Software Engineer at New Relic. “A successful SRE
the SRE. Those decisions will affect people much further up the
is someone who can understand and interpret things at a higher
stack. Good decisions enable seamless transitions.
level.” Changes can create risks or impacts down the road, not
just in that current moment, and a good SRE is sure to perform a
thorough analysis before making any changes.

see bigger
picture
11
Site Reliability Engineering

SREs are curious and empathetic

Kat Dober and Stephen Weber, both Senior SREs at New Relic, cite curiosity as a key trait
they look for in an SRE.

“You’re looking for people who have that engineering mind-set,” according to Dober. “You
want to know how it works. You want to know the ways that it might fail. You want to be
thinking about those from the beginning.”

Weber agrees: “Oftentimes, the improvements that have been most beneficial started
with ‘Oh, that’s funny.’ And then you keep digging into that.”

A related trait is customer empathy, according to Weber. “Maybe your page-load average is
pretty good, but if some subset of customers are experiencing really long load times, you
need to see that bad experience,” said Weber.

curious
& empathetic
12
Site Reliability Engineering

SREs automate at every opportunity than 2200 requests per quarter in early 2013 to fewer than 400
requests per quarter.”
While there will always be some manual exploration involved in
the role, SREs look to reduce work “toil.” Toil has a specific mean- The key to achieving this was to gradually increase the amount of
ing at Google, given by Vivek Rau in the SRE book as “the kind of automation for the various common types of support requests.
work tied to running a production service that tends to be manual, The more general lesson is that SREs will typically focus on auto-
repetitive, automatable, tactical, devoid of enduring value, and mation as a key technique for reducing painful manual tasks and
that scales linearly as a service grows.” toil.

“Automation really comes in once you understand your problem


In a follow-up article to his chapter published via Google Research,
space or once you understand your infrastructure, and there are
Rau et al provide a case study for reducing toil for the SRE team
things that you know are going to have to be done continually,”
supporting Google’s Bigtable service. “Bigtable SRE was able to
Dober said. “For example, think about how you’re going to config-
create a snowball of work reduction: each incremental reduc-
ure all your hosts, or how you’re going to get a piece of code from
tion of toil created more engineering time to work on future toil
the repo that it’s in, packaged up into an artifact or container,
reduction…. by 2014, the team was in a much-improved place
and deployed across your infrastructure. Automating those tasks
operationally—they reduced user requests from a peak of more
reduces toil but it also makes sure the tasks get done consistently

automate every
and correctly every time.”

opportunity
13
Site Reliability Engineering

This obsessive
focus on
Qualman agrees. “A lot of this role is thinking about inefficient
and time-consuming things people are doing and putting a stop
to them as soon as possible,” he said. “Instead of kicking a can

automation
down the road on manual work, you’re saying, ‘I’m going to take
the time to automate this right now and stop anyone else from
having to do this painful thing.’”

This obsessive focus on automation is a key tenet of SRE—and


DevOps—philosophy; in fact, ”The DevOps Handbook” has a is a key tenet
of SRE—and
chapter that discusses the counterintuitive effects of manual
acceptance processes. And “automation” and its variants seem to
appear more often than any other word in SRE job descriptions.

DevOps—
It’s not that unexpected to see “Automate, automate, automate,
and then…automate!” as a key responsibility in an SRE job listing.

philosophy.

14
Site Reliability Engineering

SREs are change agents

The confidence to advocate for SRE initiatives is another skill that distinguishes the best
SREs. Part of the job, simply put, involves convincing other people to do things they initially
might not want to do; for example, convincing a software engineer focused on quickly
shipping a product feature to think about ways to scale that feature over the next several
years.

This is something that can be more easily accomplished if the SREs are directly embedded
in the product teams. Speaking at QCon Plus, Johnny Boursiquot, a Site Reliability Engi-
neer at Salesforce’s Heroku, talked about SRE adoption in a presentation called “The SRE
as a Diplomat,” during which he recommended the practice of embedding SREs in existing
product teams as a way of driving change. “No two organizations implement the practices
of site reliability engineering the same way,” Boursiquot observed, “a fact that is seldom
recognized when rolling out an SRE function for the first time.”

Expanding on this theme he said:

“While there exists a set of best practices for its adoption, those that take on the task of
championing SRE within their organization know that those prescriptive approaches do
not provide all the pieces necessary for that adoption to be a smooth and immediately
impactful one.

change agents
“Nowhere is this challenge of adoption more prevalent than in organizations where teams
have complete ownership of a service from its development to its ongoing operational

15
Site Reliability Engineering

needs. In these organizations, it is common, even necessary, for even if those changes will markedly improve things. Regardless
team-specific practices to develop. This total ownership model of the reasons, building bridges across these teams requires that
works well to move business objectives forward in the early part we first establish trust. Of course, one way to facilitate this trust
of a system’s life cycle but building is to embed SRE
eventually and insidiously directly within those teams.”

Great SREs have


morphs to become unad-
In other words, great SREs
dressed technical debt when
have to be effective sales-

to be effective
maturing teams need to
people; they have to be
adopt shared reliability prac-
able to sell their colleagues

salespeople.
tices and tooling.
on processes and projects
“Bridging this gap between that might appear to involve
the intent of leadership and some near-term pain or go
the practical implication against legacy norms. “You
within teams requires change agents, in the form of SREs, to be need to be able to dig in and say, ‘stop’ and ‘no,’ which can be
embedded within these teams. Teams that see themselves as difficult to do in some engineering organizations,” according to
self-sufficient are not always incentivized to work with a traditional Beth Long.
and external SRE function requiring changes on how they operate,

16
Hiring managers are best served by
Site Reliability Engineering

not pigeonholing the SRE role to


one particular background.

SREs embrace new tools and No matter your background, the SRE role requires you to be prag-
approaches (when necessary) matic and willing to adapt. It challenges you to move out of your
comfort zone and develop new skills. “I interact with many differ-
Because site reliability engineering is still fairly new, many engi-
ent systems, different programming languages, different styles of
neers who currently hold the title worked in other jobs before
YAML that I never really thought I would ever do, versus when I
assuming the role. Some SREs might have a developer back-
was a developer,” said Weber. “Writing five different programming
ground, while others may come from traditional operations or
languages in a day is not necessarily unusual, so you just need to
sysadmin backgrounds, so hiring managers are best served by
be willing to be flexible and jump in.”
not pigeonholing the SRE role to one particular background. A
traditional QA engineer might have a good skill-set for the SRE
position, for example.

new tools
& approaches
17
Site Reliability Engineering

CH A P T E R 3 :

SRE Tools and


Processes
For an SRE, part of being pragmatic means being willing to dump
processes, procedures, and tools that may have been well-inten-
tioned but are no longer productive.

Just as there’s no universal job description for SREs, there’s no


standard toolset for the role either. However, great SREs always
seek to optimize reliability tools and processes and evangelize
them throughout the organization.

It makes absolute sense—optimization is key to a successful SRE


practice and for proper implementation of DevOps principles.
But what tools should SREs standardize on? Each team needs to
decide what’s best for them. The good news is, there are plenty
of choices.
Site Reliability Engineering

N R

CO

EA
TIO

ER
N

L-T
TI

IV
ICA

D
N

EL

IM
N

EP
U
A

D
E COMMUN

LO
PL

E CO
&
U

Y
S

N
O
TI

MMUNI
RA
G
TE
IN
-TIM

C
S
U

A
FE

TE
O

T
L

BU

ED

RA
U

ION
A

N
IL
RE

BA

PE
TI
D

CK

O
CO

Fig 1: The DevOps toolchain

Stages of the DevOps (and SRE) toolchain

If you created a “stages of the SRE” toolchain, it probably wouldn’t surprise you if it looked
a lot like the DevOps toolchain (Fig.1).

The increasing use of the public cloud and the corresponding rise in the use of Infrastruc-
ture as code tooling means that this is an area that sees particularly rapid change and
churn. We can, however, outline some of the current widely used tools and practices.

19
Site Reliability Engineering

integration
PLAN: BUILD: CONTINUOUS INTEGR ATION
AND DELIVERY:
This comprises both agile project man- Here you’ll find infrastructure as code
agement and tracking tools such as Avaza, tools, such as Ansible, Chef, Docker, Pup- It is increasingly common for developers

&deliver
Jira, YouTrack, Trello, Pivotal Tracker, or pet, and Terraform, which make re-pro- to check code into a shared repository
other task management tools. visioning environments faster, more several times a day, running it through a
consistent, and more reliable. Contain- suite of automated tests, and then auto-
ers and orchestrators, such as Kuberne- matically releasing the updated code to
tes and Docker, also play a role, allowing production if the test suite passes. The

build
developers and SREs to work against dis- approach combines CI/CD tools such as
posable, virtual replicas of production. AWS CodePipelines, Bitbucket pipelines,
plan
CircleCI, and Jenkins with testing tools
Source control and collaborative coding
such as JUnit, Mabl, Sauce Labs, and
tools such as Bitbucket, GitHub, and Git-
Selenium. A critical point regarding con-
Lab as well as IDEs, such as IntelliJ IDEA
tinuous delivery is that while teams have
and Visual Studio Code, are also widely
software that is ready to deploy, they
used.
don’t necessarily deploy it immediately.
(See Deployment below.)

20
Site Reliability Engineering

continuous
operate
Many New Relic customers also build AIOps tools, such as New Relic’s Applied CONTINUOUS FEEDBACK:

feedback
pipeline dashboards to help track this Intelligence, can proactively monitor Covering both the culture and processes
stage of the process: your services for anomalies and notify for collecting regular customer feedback,

deploy
you with real-time failure warnings and aided by tools such as GetFeedback,
actionable details so you can investigate Slack, and Pendo. The feedback part of
faster. Incidents can be delivered directly the loop also includes metrics on perfor-
into tools such as PagerDuty. mance and processes, so for example,
DEPLOY: Jira tickets for DRI (Don't Repeat Inci-
dents) work. A release dashboard is also
Deployment is a separate step if you are
an example of continuous feedback.
doing continuous integration and deliv-
OPER ATE: ery but not yet continuous deployment.
You use the same tools as the continu-
This typically involves monitoring tools,
ous integration step above, but the key
such as New Relic, alongside incident,
difference is whether the default is to
change, and problem tracking tools,
deploy the code as soon as it is ready.
such as Jira Service Desk and Status-
There are business reasons for not doing
page, PagerDuty, or Zendesk. At New
continuous deployment, but creating fre-
Relic, our SREs and engineers use the log
quent, small, incremental updates that
management capabilities and custom
ship automatically is an SRE best practice.
instrumentation.

21
Site Reliability Engineering

Nothing is written in stone

The tools SREs use at any given time will depend on where an
The ability
organization is on its SRE journey, and the shift we’ve seen to the
public cloud has also changed the role considerably. New trends,
including the ability to automate through AIOps tooling, will con-
to automate
tinue to redefine the role.

While less mature organizations will tend to use more specialized


through AIOps
tooling will
operations tools, more mature organizations will see more con-
vergence between SRE and software engineering toolchains. So,
while it’s certain that there’s no one-size-fits-all set of tools, SREs

continue to
should experiment with and adopt the right tools as they seek
new, more efficient ways to bring greater reliability to everything
they do.

redefine
the role.

22
Site Reliability Engineering

CH A P T E R 4 :

The
Evolving
SRE Role at Google’s Site Reliability Engineering book does a great job of out-
lining what a great modern SRE practice can look like in a DevOps

New Relic world. But what about SRE practices at companies that aren’t as
large as Google? For all that’s been written about reliability prac-
tices, it’s surprisingly hard to find specific, detailed descriptions
of the day-to-day role that SREs play in other engineering organi-
zations. Most descriptions on the internet contain rather vague
phrases like, “SREs combine software engineering and opera-
tional skill sets” and “SREs automate all the things.”

Defining the role

Creating the New Relic SRE description took time and involved
input from individual SREs and executive leadership.

SREs at New Relic are engineers who focus on, and are recognized
primarily for, improving the reliability of our systems. From a
business perspective, the goal of the work that SREs do is to build
and maintain customers’ trust, and allow the business to scale
by steadily decreasing the per-service and per-host operational
overhead of New Relic’s platform.
Site Reliability Engineering

At a high level, SREs make this happen by:

• Championing reliability best practices

“We’re
• Guiding designs and processes with an eye toward resilience
and low toil
• Reducing technical complexity and sprawl

building the
• Driving the usage of tooling and common components
• Implementing software and tooling to improve resilience and
automate operations

Evolving the role

When New Relic first created its SRE function, it was based
reliability
practices
around a centralized team, very much as described by Google,
but New Relic now has SREs permanently embedded into the
various product teams. This latter approach is similar to the one

into the tools


that Boursiquot described earlier.

Gus Shaffer, a Senior Director of Engineering in the Telemetry


Data Platform group, which has a high concentration of embed-
ded SREs, told us that having a centralized function for reliability
worked against the DevOps goal of having one team responsible that people
for coding and release. “We found that there was an abdication
of responsibility for reliability, where people are like, ‘Oh, well,
there’s a reliability organization, they’re responsible for reliabil- are using.”
ity,’” Shaffer explained. “When, in fact, the reliability organization
was actually responsible for measuring and reporting and helping Stephen Weber, Senior SRE, New Relic
people figure out what the trends are in their reliability, and put-

24
Site Reliability Engineering

ting together processes and policies to help people do the right The change does come with its own set of challenges, however.
thing.” One is that the lack of a centralized SRE function makes it harder
to deal with cross-cutting concerns. For New Relic, an example
Weber echoed this view: “I think the biggest advantage of going
is Apache Kafka, which is used for all New Relic’s data pipelines.
from that central team to embedding on the platform teams is
Extensive use means that it is vitally important that the platform’s
that we’re taking on the idea of building the reliability practices
various clients use it as efficiently as possible. To help ensure this,
into the tools that people are using.”
New Relic is looking at introducing quotas and has spun up a pro-
The new structure makes it easier for the New Relic SREs to stay duction engineering team with a rotating roster of engineering
current with the overall product architecture. The structure staff. “We’ve brought in people from all these different teams so
change also reduces the amount of auditing work and performing that we have subject matter expertise in all the different systems
the role of “bad-cop” that SREs are often required to do. More- within the data platform,” Shaffer explained. “It means that we
over, it made it easier for SREs to spend more time on develop- have instant buy-in on making these changes, because people
ment—a different way of achieving the same goal that Google’s that are on the teams that are being impacted are part of this
50% cap aims for. In other words, changing the structure elimi- ‘centralized’ SRE team.”
nated several problems, effectively executing an Inverse Conway
Maneuver.

25
Site Reliability Engineering

The SRE role at New Relic has also evolved in response to other
factors. New Relic is increasingly moving toward using public
cloud infrastructure rather than its own data centers. That shift
has resulted in a corresponding change in how New Relic’s teams
work with software-defined infrastructure.

The change to the public cloud also means that using cloud
resources efficiently has become an increasingly important part
of the role. “The SREs are not necessarily the ones who are watch-
ing the AWS bill,” Shaffer told us, “but they are responding to sig-
nals from leadership, like ‘This system seems really expensive,
more so than it probably should be, can you look into that?’ It is
also a part of the capacity management function that you don’t
over-provision.”

26
Site Reliability Engineering

What SREs do at New Relic

To summarize, the following table provides a high-level overview of the current SRE role at New Relic.

TYPE OF WORK EXAMPLES NOTES

Learn and enhance New Relic oper- • Update your team’s risk matrices. • This is a particular focus for new
ational and reliability best prac- • Manage capacity in advance of customer SREs and SREs working with new
tices, (e.g., capacity planning, SLOs, demand. teams.
incident response), and work with • Think about costs and the way we use cloud • All SREs stay current on platform
teams to adopt those practices. resources effectively. tooling and SRE community best
• Influence the team to prioritize the most practices.
important reliability work.

Build or help teams adopt core • Work with teams to migrate systems into a new • SREs are expected to use existing
shared internal components. version of our shared deployment pipeline. tools rather than introducing new
• Contribute code or tools to our container tools or systems.
runtime platform.
• Limit technical sprawl by guiding teams to select
appropriate existing tools rather than building
new ones.

Improve the monitoring and observ- • Work with teams to clean up noisy unused alerts • SREs actively use and extend
ability of the New Relic platform. and ensure that important problems are alerted existing New Relic products
on. whenever it’s possible and effective
• Build integrations to create new visibility into our to do so and to influence product
platform. management to implement
necessary features when it’s not.

27
Site Reliability Engineering

Set up your SREs for success Finally, it’s critical to create a community of practice and mentor/
mentee relationships for SREs and others who care deeply about
Although this SRE role description and approach works well at
reliability and sharing best practices—that’s what creates a cul-
New Relic, it may not be right for other organizations. Regardless,
ture of reliability.
it provides a useful example and helps clarify the tremendous
value a great SRE practice can bring. By developing your own
guidelines, you can set up SREs for success and advance the col-
lective understanding of the vital role the SRE practice will play as
it matures to support the ever-increasing complexity of comput-
ing platforms.

28
Site Reliability Engineering

Execution

Once you define the SRE role and have the right organizational
structure and incentives in place, it all comes down to execution.
A successful SRE team depends on a variety of skills and traits.
You can always teach technical skills, but you can’t necessarily
impart equally essential qualities such as empathy and curiosity.

Some engineering cultures, such as New Relic’s, prize autonomy—


but that doesn’t mean teams should have to tackle reliability
independently. Teams (and individual SREs) need organizational
support, communication, and, above all, trust to thrive.

A guiding philosophy for successful SREs might be expressed this


way: Don’t chase a holy grail—you can’t prevent things from ever
breaking. Instead, work tirelessly to see the big picture, incorpo-
rate automation, encourage healthy patterns, learn new skills and
tools, and improve reliability in everything that you do. Perfection
may be unattainable, but continually striving to do things better is
the way to get as close as possible.

Successful DevOps starts here. Measure what matters and


innovate faster. Sign up for a free account.

Common questions

Powered by AI

In SRE practices, "toil" refers to repetitive, manual tasks involved in running a production service. Automation helps mitigate toil by streamlining processes, reducing human error, and freeing up time for higher-value work, allowing engineers to focus on innovation and improving system reliability rather than routine operations .

Transitioning SRE roles towards cloud infrastructure management can significantly enhance operational efficiency by leveraging the scalability and flexibility of cloud resources. SREs can optimize resource use, reduce costs, and improve service reliability through efficient capacity management. However, it requires a shift in focus from traditional data center operations to cloud-specific challenges, necessitating new skills and processes .

High-performing SREs distinguish themselves through their ability to see the bigger picture, understanding how changes impact the overall system and business outcomes. They exhibit curiosity, empathy, and a pragmatic approach to problem-solving. Such engineers are not only technically proficient but also proactively consider how their actions affect broader organizational and customer experiences, setting them apart from their peers .

A lack of centralized SRE functions in a distributed team structure can lead to challenges in addressing system-wide reliability issues and establishing common practices. While embedding SREs in product teams fosters ownership and accountability, it may require additional coordination to manage cross-cutting concerns effectively, such as monitoring systems that impact multiple teams, which could hinder cohesive reliability practices .

SLI relevance to business outcomes is crucial as it connects technical performance metrics directly with user experience and satisfaction. Effective SLOs ensure that SLIs measure aspects that genuinely affect users, such as error rates and response times, enabling teams to prioritize efforts that align with business goals and customer needs, thereby enhancing overall service effectiveness .

Embedding SREs into product teams improves system reliability by integrating reliability practices directly into development workflows, promoting ownership and accountability. It encourages teams to internalize reliability, leading to better collaboration and responsiveness to incidents. However, this approach may make addressing cross-cutting concerns more difficult due to the lack of a centralized function, necessitating additional coordination for tasks affecting multiple teams .

Consistently exceeding SLOs allows teams to move faster, take more risks, and deliver more features, thus enhancing business agility and innovation. However, there is a risk that consistently exceeding these objectives can lead to complacency, where necessary improvements in reliability may be overlooked, potentially resulting in future service failures .

The shift towards AIOps and public cloud significantly influences the evolution of SRE by enabling enhanced automation, scalability, and efficiency in system management. AIOps can improve incident management and predictive analysis, while public cloud facilitates agile resource allocation and cost management. Together, they drive the convergence of SRE and software engineering toolchains, fostering greater innovation and reliability .

Error budgets help balance reliability and development speed by allowing a predefined amount of system unreliability, which teams can use to innovate and roll out new features. If the error budget is consumed too quickly, it signals the need to focus resources on improving reliability. Conversely, if errors remain within the budget, teams might accelerate development efforts. This system ensures joint ownership of product quality between engineering and product development teams .

Understanding the service life cycle is crucial for setting appropriate uptime goals. For a young product, focusing on high uptime may hinder innovation and experimentation, thereby being counterproductive. However, as a product matures and gains a customer base, uptime becomes critical as downtime can negatively impact the provider's bottom line. Therefore, increasing uptime targets is logical as a product matures .

You might also like