0% found this document useful (0 votes)
85 views33 pages

Catchpoint 2021 SRE Report

The 2021 SRE Report highlights the importance of site reliability engineering (SRE) in managing digital business services while addressing challenges like cost and complexity. Key findings include a notable reduction in toil levels, the need for platform operations due to multi-provider strategies, and the slow adoption of AI in monitoring. The report emphasizes the necessity of establishing baselines for SRE activities to measure improvements effectively and encourages SREs to align their practices with business outcomes for greater impact.

Uploaded by

Al Clarke
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
85 views33 pages

Catchpoint 2021 SRE Report

The 2021 SRE Report highlights the importance of site reliability engineering (SRE) in managing digital business services while addressing challenges like cost and complexity. Key findings include a notable reduction in toil levels, the need for platform operations due to multi-provider strategies, and the slow adoption of AI in monitoring. The report emphasizes the necessity of establishing baselines for SRE activities to measure improvements effectively and encourages SREs to align their practices with business outcomes for greater impact.

Uploaded by

Al Clarke
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

2021

SRE Report

FREE
TO BE AN

SRE
Contributing
Partners

DevOps
INSTITUTE
Foreword
How can leaders grow digital business services the results, you will also discover interesting changes
without incurring the problems associated with from the previous year’s survey, not to mention some
rapidly increasing scale, such as increased cost, noteworthy correlations.
risk, and complexity? The obvious part of a larger
answer (especially given the name of this report) Why should you read this report?
is to implement site reliability engineering (SRE). • Find out about the expected patterns and
surprising anti-patterns of SRE.
SRE is a discipline which fundamentally asks software
• Gain insight into the industry’s most
engineers to design operations teams. It applies
comprehensive SRE tenet baselining data.
aspects of software engineering to infrastructure and
operations problems. The result: ultra-scalable and • Understand usage of the five major monitoring
exceptionally reliable distributed software systems. components (digital experience, application,
infrastructure, AIOps, and network).
SRE complements DevOps by measuring and
• Learn from an example value stream to bridge the
achieving the reliability of applications and services
IT-to-business conversation gap.
working in production and DevOps infrastructures
in a prescribed manner. It combines the use of Before continuing, you should relax your baselining
error budgets, team relationships brokered by efforts when trying to match the Google definition
collaboration, ops-as-code, and reliability control of SRE. There are too many dimensions for a one-
practices to ensure deployments meet Service Level size-fits-all approach.
Objectives (SLOs).
You are hereby free to be an SRE.
In analyzing the data from almost 300 practitioners
and experts who participated in our survey on the
state of SRE, we found some familiar trends and Mehdi Daoudi
some provocative anti-patterns. As you go through CEO, Catchpoint

1 2 3 4 2021 SRE Report / 2


Contents
2 Foreword 20 Key Finding 3
The Shift Toward AIOps Is Slow
4 Introduction
25 Key Finding 4
5 SRE Report Highlights Observability Must Include Digital
Experience Metrics and Business KPIs
6 Key Finding 1
Levels Of Toil Lower Around the World; 30 Conclusion
Budget Usage Less Than Expected
31 Demographics and Firmographics
14 Key Finding 2
Multiple Providers Highlight the Need 32 Bios
For Platform Ops

2021 SRE Report / 3


Introduction
We are grateful to all the SREs worldwide who responded to this year’s SRE Survey. We had almost 300
respondents from site reliability engineers working in tech, the financial sector, media and entertainment,
telecom, manufacturing, and other industries. Thank you so much for sharing your precious time and
making this year’s SRE Report the largest of its type in the field.

As we consider the past year’s global events, current emerging trends, and future thoughts on how SRE
implementations will change the world, we hereby declare:

SREs are beautiful.


SREs build reliable services.
SREs build resiliency in themselves.
The SRE community is strong.

ACKNOWLEDGEMENT AND CONTRIBUTIONS

SRE Report Passioneers SRE Report Contributors Supporting Partners Special Thanks
Thank you to Leo Vasiliou, Eveline Oehrlich, and Thank you to Holly Chessman, Laura Meyer, Neil Thank you from all of us at Catchpoint to Thank you to our industry experts: Kurt Andersen,
Anna Jones. This report would not have been Myers, Gordana Neskovic, and Stela Udovicic. DevOps Institute and VMware Tanzu. The Simon Aronsson, Helen Beal, Johnny Boursiquot,
written without your passion and inspiration. Your contributions shaped this entire report. sum of our partners is greater than the whole. J. Bobby Dorlus, Tamara Miner, Sanjeev Sharma,
Gaurav Shukla, and Jaime Woo. Your wisdom is
an invaluable part of this year’s report.

1 2 3 4 2021 SRE Report / 4


SRE Report Highlights
This year’s report reveals a wide range of fascinating insights. Although you will definitely want to read the report in its entirety,
here is a sneak peek into a few of the insights we’ll dig into:

Many companies don’t Companies are using a AI is exciting but it has SREs should look to expand the
establish comparative multi-provider strategy not yet been fully adopted boundaries of observability
baselines for SRE conditions across the board or embraced so they can become more
customer-focused
If you don’t have a baseline, it’s hard From cloud to DNS to API to CDN, it’s 53% of SREs say the number one cloud
to tell what improves. For example, our all about the numbers. But managing application monitoring challenge Blending business, performance, and
survey showed that the average drop a variety of providers is complicated, is unified visibility across the stack. operations insight with monitoring is key
in toil from last year was 15%. Was this making the need for Platform Ops When monitoring and management to growing the SRE role. To make SRE
due to COVID-19, people doing less busy rise to the fore. While 53% of SREs rank strategies get to the point that they can successful, SREs need to tie their success
work at home, and the work feeling “providing trainings on third-party depend on artificial intelligence and back to the business and get business
more meaningful? Will self-reported toil platform capabilities” as a minor or not machine learning to improve decision leaders onboard with SRE practices.
rise next year as offices open and the applicable activity, maybe it’s time to making and automation, that problem That will be of huge benefit to SREs,
pain, problems, and challenges of the re-think that stance. will be vastly reduced. But for the most businesses, and customers alike.
year before are reintroduced? part, we’re not there yet.

1 2 3 4 2021 SRE Report / 5


BASELINE
KEY FINDING 1

Levels Of Toil Lower Around the World;


Budget Usage Less Than Expected
Are key SRE tenets of toil reduction, time on call, or the dev versus ops split increasing or decreasing?
Without an established comparative baseline, SREs and businesses will have a difficult time knowing
whether conditions are getting better or worse.

We recommend companies ensure they baseline the amounts of time being spent by SREs on any
single category of activity. Then, compare internal baselines with industry data, such as this SRE Report.
If the baseline is leaning too far from the “50/50 split,” with ops predominating, the nature of SRE
work needs to be rethought. This starts with the measurement of how SRE time is spent (since what is
measured is what will improve).

1 2 3 4 2021 SRE Report / 6


SRE Activity Breakdown
SRE Activity Breakdown
Prioritizing Value-Based
SRE Activities Responding to incidents or outages
36% 44%
6% 14%
To inspire innovation and solve complex problems,
Post-mortem analysis and/or write-ups
SREs perform all types of activities. This includes low-
9% 17% 39% 35%
level operational tasks, tactical implementations, and
high-level strategic initiatives. The derived value of Participating in on-call rotation
some of these activities is quickly realized. Others may 13% 26% 30% 31%
take time to show value or the value may be more
Developing applications or capabilities
difficult to measure. 7% 24% 42% 27%

This is why the desire to “automate ALL the things!” is a Experimenting or receiving training to expand knowledge or skills
core SRE tenet. People want to give themselves the best 10% 23% 40% 27%
chance of focusing on value-based activities. In other
Authoring business processes, rules, or best practices
words, for SREs, the focus of automation is less about
9% 20% 45% 26%
deploying code or releases as fast as possible. Instead,
it is aimed at delivering value to the customer with a risk- 9%
Performing audits of usage/cost allocation
mitigating, scalable approach that balances the need for 13% 35% 31% 21%
agility against the need for stability.
Spinning up new hosts/instances
For the first key finding, let’s explore the core tenets of 17% 32% 30% 21%
SRE, including the amount of time SREs are spending Planning release roadmaps
on toil, the Dev versus Ops split, and the use of error 22% 30% 28% 20%
budgets. After all, how can you measure improvements
without a baseline? Performing chaos engineering exercises
24% 33% 26% 17%

Providing trainings on third-party platform capabilities


18% 34% 33% 15%

Load testing or other capacity management activities


14% 39% 32% 15%

N/A Minor Moderate Major

1 2 3 4 2021 SRE Report / 7


Toil Drops In 2021… But Is This Temporary?

“ Toil is the kind of work tied


to running a production service
that tends to be manual, repetitive,
automatable, tactical, devoid of
enduring value, and that scales
25%
MEDIAN
TOIL VALUE
2021
40%
MEDIAN
TOIL VALUE
2020
linearly as a service grows.”
Site Reliability Engineering,
O’Reilly Media, 2016

1 2 3 4 2021 SRE Report / 8


How is Toil Measured
How Is Toil Measured?
The average drop in toil year-over-year is 15%.
This marked drop is important to note because toil
numbers across the entire distribution (not just the
respondent median) are down from 2020 to 2021. 34% 24% 22% 20%
That said, it is worth noting these are self-reported
figures. Additionally, while around 45% of SREs
said they measure toil, only 22% said it is measured I/We do not Toil is measured Toil is measured I/We are trying
scientifically. The other 54% said they either do not measure or anecdotally scientifically to measure toil,
measure or track toil or that they are trying to do track toil today. (e.g., vent sessions (e.g., detailed but are finding it
so, but are “finding it challenging.” in the hallway). tracking via a system). challenging.
Causes Of Toil

Causes Of Toil Minor Moderate Major N/A Why Did Toil Numbers Drop?
Too much Only 9% of SREs said COVID was a major factor for
technical debt 17% 36% 42% 5% high amounts of toil. This leads us to theorize that the
universal reported drop in toil is related to the global
Priorities or goals
are not aligned 21% 40% 32% 7% shift to working from home over the last year. Did the
shift in related working conditions and priorities mean
The business value that work felt more meaningful and therefore less like
to fix is hard to realize 23% 40% 28% 9%
toil? That could be the case, but we’re not sure.
Lack of training So, will toil remain lower? Probably not. We expect the
or support 28% 43% 20% 9%
amount to rise again next year when a hybrid office
Lack of is more of a reality. Note that the “hybrid office” could
collaboration 35% 35% 17% 13% involve employees continuing to work from home,
returning to the office full- or part-time, or some
COVID 49% 15% 9% 27% combination of the two.

1 2 3 4 2021 SRE Report / 9


Time Spent on Dev Work
Does Google’s Definition Of SRE Matter? (versus Ops Work) and On Call
Google famously places a 50% cap “on the aggregate ‘ops’ work
for all SREs — tickets, on-call, manual tasks, etc… Google’s rule of
thumb is that an SRE team must spend the remaining 50% of its
time actually doing development.”
Time Spent On Dev Work % of work on dev
(Versus Ops Work) and On Call % of work on call
Ultimately, the Google foundational definition of SRE is a guideline, 100
not a rule. This is not to say that businesses should not baseline
90
core SRE tenets against the Google definition. After all, entities The median value of
performing both development and operational activities will 80 40% time spent exclusively
on development
be driven by the passionate desire to solve problems. However,
70
businesses should focus primarily on inspiring and incentivizing
SREs to achieve goals and objectives instead of trying to match 60
20% The median value of

% of time
time spent on call
the elusive 50/50 dev versus ops split. The drive to solve complex 50
problems should always outweigh semi-arbitrary decisions, such
as how an organization is structured or whether a business should 40
mandate the use of open source for all tool selection. 30

20
SRE is fundamentally more about people and processes than
technology. It’s a good sign that SREs across the globe are 10
spending time on development activities. The purpose of SRE
0
is to bridge the gap between platform design, development, 0 10 20 30 40 50 60 70 80 90 100
and operational execution. This means SREs have to shift-left
% of SREs
(which they are doing) into development to share the wisdom of
production to those teams delivering value-adding products and
services to that production environment.

1 2 3 4 2021 SRE Report / 10


SRE Budgets: Hardly Outweighs Highly In All Categories
When asked how budgets are used — whether While an error budget is calculated based on specific Think of error budgets as something that force your
performance, toil, training, or error budgets — SLOs, you also need to establish specific error budget SRE team to consider whether the right metrics are
“hardly” outweighed “highly” in all categories. In policies to determine what actions need to be taken in place to meet company expectations. If you are
fact, in terms of error budgets, only 20% of SREs if the error budget is depleted. When looking at the not using them appropriately, the speed by which
aid they were “highly” using them. work done around SLOs, we find that half of SREs are software is delivered and changes are implemented
continually refining their SLOs. This is a good indication are likely impacting your company’s customers.
Using error budgets demands that service level that, even though error budgets are not used as highly
objectives (SLOs) and user journeys are designed and as we anticipated, businesses are working to find the The goal is to shift left toward being able to address
captured within service level agreements (SLAs). The right balance between innovation and reliability. reliability problems sooner. The earlier you catch issues,
next step in the journey toward wider usage of error the cheaper they are to fix.
budgets by SREs is to connect service level indicators
(SLIs) and set SLOs.

Budget Utilization Hardly Moderately Highly N/A

Performance budgets 30% 35% 21% 14%

Error budgets 32% 35% 20% 13%

Training budgets 33% 37% 17% 13%

Toil budgets 40% 33% 9% 18%

1 2 3 4 2021 SRE Report / 11


“ The business or the product must establish the system’s availability target.
Once that target is established, the error budget is one minus the availability
target. A service that’s 99.99% available is 0.01% unavailable. That permitted
0.01% unavailability is the service’s error budget. We can spend the budget on
anything we want, as long as we don’t overspend it.”
Site Reliability Engineering,
O’Reilly Media, 2016

Service Level Attributes

We publish our We look at SLOs in We know how each of


50% We continually
refine our SLOs. 30% SLOs to our users
or customers to set 29% relation to system
boundaries and 28% our SLOs are treated as
part of business SLAs.
expectations. then define tiers.

We work to avoid We regularly


29% 29% 23% 27%
It is easy to find It is easy to
triggering the compare SLIs and the right data to
consequences of SLOs to decide what choose SLOs.
support our SLOs.
missed SLOs. actions to take.

1 2 3 4 2021 SRE Report / 12


SRE Spotlight


Sanjeev
Most organizations and enterprises, no matter how small or large, are not like Google. Small
and medium organizations don’t have Google’s scale. And large enterprises are not homogenous
Sharma
SVP, Head of Automation and
in their technology stacks or architecture like Google. So, yes, they all need SRE, just not in the same Platform Engineering, Truist Financial
manner as Google does. They need to automate repetitive, typical tasks in operations in order to @sd_architect
improve resilience and reliability and can learn from Google’s take on SRE. They just cannot, and SRE from: Ashburn, VA
should not, emulate how Google implemented SRE. Favorite: All things SciFi and
Superhero — from James T.
Kirk to Loki

Jaime Woo


Co-editor, 97 Things
Every SRE Should Know
@jaimewoo
SRE from: Toronto, Canada
If Google is correct that too much toil leads to frustration, boredom, and burnout, could it be that
Favorite: Haruki Murakami’s
What I Talk About When I Talk
during a pandemic, with its enduring waves of stress, SREs chose to spend their time elsewhere? Either
About Running way, did anyone even notice — and what does that say about toil?”

1 2 3 4 2021 SRE Report / 13


SCALE
KEY FINDING 2

Multiple Providers Highlight the


Need For Platform Ops
Businesses that fail to implement Platform Operations teams will eventually hit the scalability
ceiling. The rising use of multiple providers for same-service platforms (e.g., multiple cloud
providers), will drive the need for SREs to develop and offer normalized capabilities even though
underlying platform providers may have different interfaces and instrumentation.

We recommend businesses tactfully consider the assignment of SREs, or other development


resources, to shared platform teams. The key result should be a higher level of scale. At that
point, a single set of capabilities can then be treated as a product and re-used by many
different teams.

1 2 3 4 2021 SRE Report / 14


Multi-Provider Use Increases and Brings With It Greater Complexity
Let’s explore the number of multi-providers and levels cloud is your new data center and the Internet is the from improved resilience of systems to the ability
of automation in use, the topology of SRE teams within new network, than third-party services like DNS and to leverage different providers’ strengths. However,
the organization, and the nascent field of Platform Ops. APIs are your new racks and cabinets. managing a variety of providers brings with it greater
complexity, from challenges posed by a reduction
There is a vicious cycle of rising customer expectations In response to the question, “Do you have a multi- in visibility over the total infrastructure to confusion
causing businesses to manage ever-increasing provider strategy?”, cloud led the way in all geo- Use
among caseteams Automation levels
over how best to take advantage of
complexities to meet those expectations. Migration graphies. At the same time, it is clear that people are the different toolsets and rules of each vendor.
to multiple clouds and using multiple third-party using a multi-provider approach across the board for
providers are two such examples of this trend. If the DNS, API, and CDN. The reasons for this are numerous,

Multi-Provider Usage Use Case Automation Levels


N/A or maintain our own
Release
No, one provider management 11% 33% 52% 4%
48%
Yes, or plan to 44% Infrastructure
management 10% 36% 52% 2%
38% 38% 36% 38%
35% 37%
31% 31% 32% Application
12% 42% 41% 5%
28% 26%
management
24%
Network
management 22% 39% 34% 5%
14%
Incident
management 33% 38% 25% 4%
Service level
management 29% 41% 25% 5%

DNS API CDN Cloud Data Center Low Moderate High N/A

1 2 3 4 2021 SRE Report / 15


There Is No One Right Systems Reliability Engineering Topology
There are several different SRE models, each of which is organized and leveraged differently depending on the organization’s leadership, needs, challenges, and goals.
The following models are in use today:

CENTRALIZED SRE BUT SUPPORTING VARIOUS DECENTRALIZED SRE ASSIGNED BY PRODUCT THE SRE AUTOMATION TOOLS TEAM ESTABLISHES
PRODUCTS OR PLATFORMS. OR ANOTHER ATTRIBUTE. AND GUIDES IT AUTOMATION BEST PRACTICES.
An SRE is typically associated with various products, These SREs are typically embedded into application/ Another model is that of the SRE tools team, which
applications, or platforms. These types of teams have service teams to support the application team and are supports its software development counterparts in
SRE responsibility across a specific set of services responsible for automating operations (runbooks) and improving system reliability through automation
and workflows with key boundaries. Here the SREs deployment work, including monitoring and metrics work, leveraging existing tools, or creating scripts.
are functioning as an enabler to help the product population, as well as owning incident management Possible work tasks could include focusing on
or platform teams within the technical or product around application issues. The advantage of this monitoring, observability, and analysis across a
domain. The work of these SRE team members could decentralized model is a solid integration within the particular application stack or the instrumentation
include a variety of projects, such as the onboarding business, with strong business value alignment and of key infrastructure ecosystems as a best practice
of specific technologies like Kubernetes, building out a new career path evolution for the embedded SRE. for other teams to leverage.
a mature infrastructure service mesh, working on SREs can establish best practices, benchmarks, and
continuous integration activities, or providing 24x7 automation architectures across the entire enterprise.
support for specific applications and services.

42%
Decentralized 23% Decentralized by business product/service

Centralized Structure 8% Decentralized by platform (e.g., cloud)


Hybrid
38%
7% Decentralized by stack component (e.g., infrastructure)
20%

1 2 3 4 2021 SRE Report / 16


Greater Size Means Greater Decentralization
Related to the previous point is the finding that as the number of SREs and employees within an organization increases, SREs
become more decentralized. Perhaps this path is unavoidable, given the different expertise of SREs. Since this pattern will likely
increase, businesses must ensure they are avoiding a return to silos.

SRE Topology/Organization
One way of preventing Size Operations team. Such a plan allows
silos from occurring is by creating a Platform
SRE Topology/Number
decentralized teams
of Employees
the autonomy and flexibility to do their work, while the centralized Platform Ops team provides capabilities to its decentralized
colleagues. These normalized capabilities can then be integrated across workflows.

SRE Topology/Organization Size SRE Topology/Number Of Employees


60%
51% 51%
49% 50% 48%
48% 44% 40% 41%
% of respondents

37% 38% 36%

% of respondents
38% 36%
35%
31% 30% 26% 22%
23% 23%
26%
20%20% 20% 15% 13%
17%
12%

Fewer SREs More SREs Fewer employees More employees


Centralized Decentralized Hybrid Centralized Decentralized Hybrid

1 2 3 4 2021 SRE Report / 17


SREs, Meet Platform Ops
Platform Ops is fast gathering momentum among To ensure a successful Platform Ops implementation, focus on
organizations that want to scale and need greater
consistency (in tooling, security, training, etc.) across
three critical checkpoints:

1 2 3
different pipelines and products. A Platform Ops
team can focus on creating self-service capabilities by
applying automation for application developers. This
process can encompass everything needed to build an
automated DevOps value stream.

The goal is to produce a system where developers do


not have to worry about infrastructure and operations A shared, self-service The ability for those The ability for those
tooling. Instead, they can put their efforts towards platform that offers an capabilities to be capabilities to then
developing, while the Platform Ops team implements array of capabilities. enhanced within the be productized and
guardrails around cost, monitoring, and compliance context of the overall internally marketed
(and more) — without being a bottleneck. Platform business, through the for consumption, able
Ops must collaborate with the development team, use of development/ to fulfill the needs of
of course, to ensure that the designed workflows are engineering resources. many different internal
improving the development team’s experience. At the consumers.
same time, Platform Ops needs to ensure operational
boundaries and rules are established and followed.

In our question about most performed activities by SREs,


“providing trainings on third-party platform capabilities”
ranked the lowest. With 52% of respondents selecting this
activity as minor or not applicable, this represents a key
opportunity for scaling SRE implementations through a
Platform Ops program.

1 2 3 4 2021 SRE Report / 18


SRE Spotlight


Kurt
Spanning the gaps between the interfaces and the data that each provider offers Andersen
increases the difficulty for SRE teams to automate across those multiple providers. These SRE Architect, Blameless
integrations are rarely simple except for the most superficial aspects. Effectively mapping @drkurt
disparate data models together may be the next frontier for SRE in a multi-vendor environment.” SRE from: North Idaho, ID
Favorite quote: “There is only
one root cause: change.”
Dave Zweiback, Beyond Blame,
O’Reilly, October 2015

Johnny


Boursiquot
SRE, Heroku
@jboursiquot
Chances are high you’ll need a tailored SRE model for your organization. Be aware that the
SRE from: Baltimore, MD
effort you put into building this custom SRE practice will only bear fruit when all parties affected have
Favorite philosophy: Amor Fati
a shared understanding of existing shortcomings and how outcomes will be measured and reflected
while on the journey. Nothing kills the effectiveness of SRE adoption faster than a misalignment on its
aims and a lack of transparency on the value it provides specifically to your organization.”

1 2 3 4 2021 SRE Report / 19


INCREMENT
KEY FINDING 3

The Shift Toward AIOps Is Slow


AIOps promises to turn mountains of data and inert action into molehills of information and
precise actionability. But this year’s SRE Report showed adoption is slow (when compared to other
monitoring tools in use) and that “the value received from AIOps” covers a wide range of sentiment.

We recommend SREs break down the broad AIOps category into smaller components, such
as event correlation, topology analytics, or smart alerting. Once these components have been
identified, incrementally develop AIOps capabilities through an investment in training and the right
tooling to help manage the continually increasing volume, velocity, and plethora of data sources.

1 2 3 4 2021 SRE Report / 20


The Value of AIOps Monitoring Tool Usage
It’s time to explore the value of AIOps and why its uptake is slow. Infrastructure monitoring (e.g., disk or CPU)
Given the amount of data coming in from monitoring of all different 6% 9% 23% 62%
data source types (from infrastructure to application to experience),
Network performance monitoring or diagnostics (e.g., latency or saturation)
the challenge of analysis or correlation by a human is enormous.
5% 17% 36% 42%

The promise of specific AIOps components — event correlation, for Application performance monitoring (e.g., tracing or events)
instance — is huge in helping to parse through big data to arrive 4% 16% 32% 49%
quickly at the important information or (even more critical) the
Digital experience monitoring (e.g., Synthetics or RUM)
prescribed action. However, while you will need AIOps to make
21% 24% 30% 26%
sense of all this data and to reduce toil, not all components are
equally helpful. For instance, vendor promises of self-healing and
automatic remediation should be heavily scrutinized if you are
Received AIOps Value
Artificial intelligence for ITOps (e.g., anomaly detection or self-healing)
39% 27% 23% 12%
considering them to be part of AIOps’ received value.
Public/social sentiment monitoring

AIOps Is Still An Underused Field 46% 30% 17% 8%

Competitive benchmarking intelligence


SREs are always performing two types of activities: development 54% 22% 15% 9%
and operations. This is a form of bi-modal IT, where mode I involves
maintaining existing activities and mode II involves exploring and Never Rarely Sometimes Always
researching the unknown. The result is that SREs are double struck
with challenges as they are required to handle constant context Received AIOps Value
switching. This constant switching can make it difficult to accomplish
goals in either domain.
27%
One key way to help mitigate the challenges of being multi-mode Low (1-3)
for mode II (the new and unexplored mode) is AIOps. However, it is
Moderate (4-6)
clear from this year’s survey results that AIOps is still very much an 41%
underused capability. High (7-9)
32%
*Based on a 1-9
value scale.

1 2 3 4 2021 SRE Report / 21


AIOps Is At An Inflection Point Toward Becoming Valuable
Availability and reliability are key aspects for ensuring At the same time, clearly not every company is able Today’s monitoring and management strategies need
customer satisfaction and avoiding a negative to get what they need from the monitoring tools they to leverage artificial intelligence (AI) and machine
impact on customer confidence and your company’s have. After all, 53% of SREs said the number one cloud learning (ML) to improve decision making and
brand. There are a number of monitoring tool types application monitoring challenge they face is a lack of automation. Observational and engagement data
that can ensure both availability and reliability. unified visibility across the stack. What is causing this must be analyzed to allow businesses to react in real-
Digital experience monitoring (DEM) solutions are disconnect? time. When combined with automation either during or
an essential part of most companies’ efforts to after analysis, this data will enable continuous ongoing
reduce downtime and ensure ongoing services. Web Large volumes of data are draining the value of improvement and help shift IT operations to working
analytics and other IT monitoring technologies (APM, traditional APM and other monitoring tools and in a proactive and predictive way. This is exactly what
ITIM, NPMD, and AIOps) also play a role in a broader making it impossible for SREs to become proactive. AIOps automation solutions are promising.
monitoring strategy, helping to offer a complete view The demands of the digital business require a modern
of the end user experience. way of managing incidents and service health, but However, its adoption from our survey respondents is
many companies aren’t there yet. not looking optimistic. Why is this?

Cloud Application Monitoring Challenges


Cloud Application Monitoring Challenges

53% 33% 27% 22% 21% 3%

Lack of unified Huge effort in Lack of Monitoring tools I/We face no cloud Other
visibility across maintaining analytics. do not scale. application monitoring
the stack. existing tool(s). challenges.

1 2 3 4 2021 SRE Report / 22


Break AIOps Into Individual Components To Understand Its Value
Unrealistic expectations of AIOps components, Important AIOps use cases for monitoring include: AIOps is broader than a single monitoring tool. With
such as self-healing and remediation, have led to • Event correlation that in mind, it’s worth taking the time and investing
disappointing results in production environments and • Anomaly detection in training in AI and ML for SRE teams to understand
an undervaluing of AIOps as a whole. To develop a • Proactive analysis the different domains and how they might work best
clearer view of what AIOps can offer, we recommend • Root cause analysis together. Note, however, that while there may be no
breaking down its individual capabilities, then training immediate perceived value. The real worth comes in
SRE teams in AI and ML. This will make it easier to When it comes to monitoring, a good deal of value can the medium to long-term gains.
realize AIOps’ value. By looking at the individual be derived from deploying AIOps and leveraging these
components and incrementally developing capabilities, use cases in combination, to determine unexpected
you can realize the potential of those that are most patterns in large volumes of data. You can achieve this
relevant to your business. value using a real, automated inference capability.

SRE Spotlight


Simon
During the last year, most of us have had to adjust to spending a lot more time in our homes. Aronsson
We believe this transition to be one of the primary explanations for the significant decrease in how Head of Developer Relations, k6
much of our work we consider to be toil.
@0x12b
As we slowly start recovering from these trying times, some of the work we’ve considered as good SRE from: Norrköping, Sweden
alternatives to being isolated will likely begin to feel like toil again. Favorite movies: Hackers (1995)
and WarGames
We expect site reliability to stay mission-critical moving forward, justifying continued efforts in
eliminating toil. Organizations that properly manage to identify activities suitable for automation and
AIOps will have an excellent opportunity to keep the perceived level of toil low even as we readjust.”

1 2 3 4 2021 SRE Report / 23


SRE Spotlight

“ Today, AIOps solutions elevate SRE skills by automating incident responses. They can proactively
monitor the SRE’s golden signals and measure what really matters for customer experience, but if
not introduced correctly, can backfire, ending up in a more complex system to be managed! So how
can this adoption be simplified? To start with, we need to have a closer look at our current toolset and
evaluate where we have gaps, taking into account common features that current AIOps products
offer, such as baselining, RCA, anomaly detection, event correlation, simulation, etc. It’s all about
getting better at understanding and managing the complexity of your setup and then
integrating automated helpers and actions.”
Gaurav
Shukla
SRE, Catchpoint
SRE from: Delhi, India
Favorite: The Pursuit of
Happiness, Seven Pounds

J. Bobby


Dorlus
SRE, Twitter
AIOps should be part of any SRE toolkit or any infrastructure managing large scale systems.
@BobbyD_FL
Greater adoption is coming. It aligns with some of the challenges I’ve experienced at Twitter with
SRE from: West Palm Beach, FL
managing large scale systems and in particular, the vast amounts of data, metrics, and logs we need
Favorite: Independence Day,
I, Robot, I am Legend to work through. How can we parse through all that as humans and be as effective as our customers
need us to be? Most SREs working at scale are already leveraging machine learning, especially when
it comes to efficiencies around data centers (locations, cooling, and all the things that happen inside it),
for networks and building out infrastructure … Evolving that into AIOps is the next logical step.”

1 2 3 4 2021 SRE Report / 24


EXPAND
KEY FINDING 4

Observability Must Include Digital Experience


Metrics and Business KPIs
SREs that fail to deliver customer value run the risk of being stuck in an operational toil rut.
Conversely, businesses failing to recognize the importance of SRE activities run the risk of losing
talented employees and a competitive edge.

IT-to-business conversations should start around capabilities — the gateway to positive business
outcomes and a middle point in the value stream. Talking about necessary capabilities allows the
conversation to shift as needed, helping shrink the IT-to-business gap.

1 2 3 4 2021 SRE Report / 25


Monitoring Data Needs To Pivot Towards Monitoring Data Usage Drivers
Details To Ensure Customer Outcomes
For the final key finding, let’s explore what is driving the use of
monitoring data, and how SREs can expand the boundaries of
observability to include the digital experience and business KPIs. Monitoring Data Usage Drivers
With this new angle, SREs can become more customer focused.
Augment troubleshooting
and root cause analysis 8% 23% 66% 3%
The responses to the question, “What drives the use of
monitoring data?” resulted in few surprises this year. Augmenting Ensure service level
troubleshooting and root cause analysis was the resounding leader. objectives are met 14% 27% 51% 8%
Ongoing changes continuously require better root cause analysis
Enhance the
and troubleshooting, so this came as no shock. “Ensuring service
customer experience 10% 36% 49% 5%
level objectives (SLOs) are met” came in second. While only 49% of
SREs said, “enhance the customer experience” was a major driver Improve employee
productivity 23% 38% 31% 8%
for the use of monitoring data, ensuring SLOs indirectly enhances
customer experience. Thus, these two responses can be viewed Provide analytics to
together. A quarter of SREs chose “accelerate our ability to innovate our business teams 27% 35% 31% 7%
automation” or “ability to support collaboration efforts” as major
Accelerate our ability
drivers for the use of monitoring data. to innovate automation 29% 40% 25% 6%
As can be seen from the survey results, SREs are pivoting towards Ability to support
focusing on customer experience, blending business, performance, collaboration efforts 31% 34% 25% 10%
and operations insights with their monitoring approach. For SREs to
be successful, they must work with developers, customer intelligence Minor Moderate Major N/A
(CI) pros, CX experts, marketers, and ecommerce professionals. This
type of collaboration will enable them to understand what critical
data needs to be observed and leveraged to predict and populate
business-oriented KPIs.

1 2 3 4 2021 SRE Report / 26


Expand Observability Boundaries To Enhance Customer Experience
In a world where businesses are rapidly imple​- SREs will include digital experience data (both By including a wider range of monitoring tools, such
menting digital transformation initiatives, SREs human and machine), such as social sentiment data, as those explicitly focused on business perception
must expand their observability boundaries to benchmarking data, and that which directly applies and ranking, SREs can better realize their role within
include digital experience and business KPI data. to business KPIs. Why? the organization. This is true, not just in terms of
Since SREs are innovation providers, they are not maintaining reliable systems, but also in terms of
bound by traditional definitions. Therefore, this is a IT needs data from the end-user point of view, to helping realize greater value for the business.
business opportunity to take evolutionary concepts, fully measure quality and consistency of experience
in this case observability, and apply them to more delivery. Outside-in monitoring via digital experience On a similar note, more SREs than not are choosing
than just an IT domain. monitoring (DEM) allows businesses to look at to maintain traditional monitoring habits by focusing
what a customer or end user is seeing. In this way, on inside-out monitoring versus outside-in. Measuring
When asked what data sources feed SRE frame-​ organizations can determine if the application is whether the servers are up and the lights in your
works, again, there were few surprises. Application delivering what the customer wants and expects. NOC are green is a narrowly focused surface area.
monitoring came out on top, with infrastructure A poor customer experience directly impacts a On the other hand, monitoring to ensure customer
monitoring a close second. What was interesting was company’s reputation and bottom line. That’s why experience across the entire delivery chain is more
the rare use of benchmarking intelligence and public ensuring a good customer experience aligns directly broadly focused and will ensure a greater alignment
sentiment/social media monitoring. Forward-looking with overall business goals. with business goals.

Observability Data Sources

84% 81% 69% 48% 38% 13% 7%

Application Infrastructure Network Front-end user Client-side Collective Public sentiment/


monitoring monitoring monitoring experience device or endpoint benchmarking social media
monitoring monitoring intelligence monitoring

1 2 3 4 2021 SRE Report / 27


SREs Must Connect Monitoring Practices To Customer Value Realization
SREs are often inwardly focused on operational conversations To bridge this gap, SRE conversations should start with Successful SRE
and making sure services are reliable. SREs place a great deal capabilities. For example, when asking for budget, instead
Implementation Drivers
of importance on the left three elements of the table on this of saying, “Can we have $$$ to buy Google Docs?”, a more
page: inside-in monitoring of service level indicators, helping effective way to reframe the question might be, “Can we have
to extend reach (using third-party delivery chain components
like CDN), and working toward reliable services (through use of
$$$ to increase our ability to quickly collaborate with external
partners on large projects?”
60% How quickly
we resolve incidents

high-availability, failover architectures, or chaos exercises).


In addition to reducing frustration, using a value chain path
What they are not doing, however, is sufficiently aligning
themselves with business outcomes or realizing customer value.
increases the ability to reproduce results. By having a path that
leads to realizing value, SREs and businesses alike can scale
43% The amount of time
between failures

This is the direction in which business leaders typically take the their activities and avoid “blind luck” scenarios where they are

41%
conversation. They make more outward-facing, value-based not quite sure how value was realized (even though they must How quickly we do root
decisions like how to land new logos or retain existing revenue. reproduce that value over and over). cause incident analysis
The result is an often large and frustrating gap or the less-than-
optimal use of resources to accomplish goals or objectives.
40% How quickly we push
product updates
IT-Focused Business Value-Focused

33%
How quickly our
business can expand
to new markets

22%
How quickly we can
understand the cause of
social media sentiment
Uptime and Customer Reliability and Capabilities Business Value
Other Service Experiences Resiliency Outcomes Realization
Level Indicators and Reach

Capabilities are the gateway to positive business outcomes.

1 2 3 4 2021 SRE Report / 28


SRE Spotlight

“ In addition to understanding how business value ties back to SRE SLIs, it is important to close the
feedback loop by communicating actual customer sentiment and felt value back to the engineers working
on improvements in these areas.

Recently at Improbable, we received some feedback from a customer around how happy they’d been with
our team addressing key SRE metrics (e.g., shorter deployment times and reduced number of incidents). Our
proactiveness in soliciting feedback via customer satisfaction surveys and ad hoc meetings was called out
as a huge plus, allowing them to focus on improving their customer experience and increasing their scale
significantly. Instead of that feedback stopping at the Account Manager, we ensured we had a closed loop
Tamara
Miner
Engineering Manager,
Improbable
@tammasaurusrex
SRE from: London, UK
process, so our engineers could see the impact of their work and motivate them to anticipate customer
Favorite: Erin Meyer’s
needs in future development.” The Culture Map

Helen


Beal
Chief Ambassador,
DevOps Institute People don’t have time to analyze data all the time — they’re busy dealing with incidents and trying to
@BealHelen make improvements. And all those different monitoring tools feeding the observability frameworks… It seems
SRE from: Chichester, UK there’s a gap — we’ve got the data, we’ve got the performance model, but we haven’t connected the two. The
Favorite: Yann Martel’s observability and AIOps framework vendors need to up their game for the SRE personas and create out-of-
Life of Pi the-box patterns for the management of these all-important SRE metrics.”

1 2 3 4 2021 SRE Report / 29


Conclusion
With the exception of Google’s research, we believe this to be the most data-backed report of its kind. We
looked at the data for each individual question and then performed cross-question correlations, some of
which did not make it into the report for the sake of brevity. Examples include:

• As people use more multi-provider third-party strategies, they buy more than they build.
• The more decentralized, the more time on call. We hereby
• Xbox users have a greater preference for Macs over their PlayStation counterparts.
declare by the
In our desire to evolve the SRE field, we’ll publish additional findings from the report over time. To stay up
on our findings, please subscribe to our blog and follow us on LinkedIn or Twitter. power vested in us
One of the underlying themes for this year’s Report was to look at the five major monitoring components (by no one at all),
(digital experience, application, infrastructure, AIOps, and network) and understand how the promise of
AIOps will help them fit together. In order to run, you must first walk through drafting an established line in
you are free to
the sand, i.e., baseline at the outset so you know where your journeys begin. be an SRE.
As we’ve determined, SREs must attach to business value conversations. It’s also important to consider the
different perspectives of, and empathize with, different teams. Your IT counterparts, for example, may not
have the development and engineering resources that a well-honed, well-tuned SRE may have. Consider
offering talent sharing programs, because your development experience may exponentially increase their
ability to improve the efficiency of their programs. In other words, an ounce of development to you is worth
a pound of development to others.

This brings us full circle to one of our core SRE philosophies, which is the common, passionate desire to
solve complex problems.

2021 SRE Report / 30


Demographics and Firmographics
Respondents to the SRE Survey came from all over the world with almost half representing North America, a quarter In what industry is your organization?
representing Asia, and a fifth representing Europe. Other locations represented included Oceania and South America.
Respondents came from a wide range of industries, with technology leading the pack, followed by financial services, Technology or Technology Provider 41%

then media or entertainment. Over half the respondents cited IT operations as their primary area of expertise, closely Financial Services 13%
followed by application or software development/engineering and IT infrastructure. Sixty percent of respondents said Media or Entertainment 8%
they had more than one area of expertise. Telecom 6%

Manufacturing 5%
What is your role? What is your primary area of expertise?
Consumer Packaged Goods or Retail 5%
SRE individual practitioner/subject matter expert 47% IT Operations 52%
Healthcare or Chemicals 4%
Team leader/supervisor 16% Application or Software Development/Engineering 50%
Professional Services or Consulting 4%
Manager 14% IT Infrastructure 47%
Energy 3%
Senior management (Director, Vice President) 9% Architect 27%
Transportation 3%
External consultant/contractor/coach 5% Network Operations or Engineering 20%
Government or Non-profit 3%
C-Suite executive 3% Security 16%
Travel or Accommodation <1%
Other 6% Database Engineer 11%
Other 4%
Service Desk or Support 9%
How many employees does your company have?
CIO, CTO, CXO, or other C-Suite Executive 4% Where are you located?
One to 100 18% Other 4% North America 48%
101 - 1,000 29%
Asia 24%
1,001 - 10,000 23% How many SREs are in your organization?
Europe 21%
10,001 - 100,000 21% One to ten 52%
Oceania 3%
More than 100,000 9% 11 to 100 34%
South America 3%
101 to 1,000 11%
Africa 1%
More than 1,000 4%

1 2 3 4 2021 SRE Report / 31


Bios
Kurt Andersen is currently the SRE Architect at Blameless. J. Bobby Dorlus is a seasoned Systems Engineer with over Sanjeev Sharma is an internationally renowned digital
His writings appear in various O’Reilly books including 18 years of experience working for companies like Twitter transformation leader with a track record in the areas of
What is SRE? and he serves on the Board of Directors and Citrix. In his eight-year journey at Twitter, he has DevOps, DataOps and Cloud Adoption, and Application
for USENIX and as part of the steering committee for the worked on teams that are responsible for the foundational and Data Modernization. Sanjeev is a former IBM
worldwide SREcon conferences. platform that most services @Twitter rely on. He is known Distinguished Engineer, and the former Global Field CTO
for his historical knowledge of Twitter’s technical journey. of a VC backed startup. He is currently Head of Platform
Simon Aronsson is the head of DevRel at K6. He’s a long- Engineering at Truist Financial.
time reliability and DevOps nut helping software teams Anna Jones is a fanatic about conveying technologists’
build reliable, performant software. pains and problems through the written word. Gaurav Shukla is an experienced and accomplished Linux
Systems, Cloud and Site Reliability Engineer with a deep
Helen Beal is a DevOps and Ways of Working coach, Chief Tamara Miner has worked on infrastructure and passion towards observability, automation, and designing
Ambassador at DevOps Institute, and an ambassador for developer tools for over 15 years in the US and Europe. of fault tolerant systems.
the Continuous Delivery Foundation. She is the Chair of She is currently the Engineering Manager of Improbable’s
the Value Stream Management Consortium and provides Partner Engineering team in London. She is a recipient of Leo Vasiliou is a former ITOps practitioner and current
strategic advisory services to DevOps industry leaders the Forbes 30 Under 30 and the Microsoft XBox Women in chart and graph whisperer.
such as Plutora and Moogsoft. Gaming Rising Star awards.
Jaime Woo is a writer, SRE educator, and former molecular
Johnny Boursiquot is a multi-disciplined engineer at Eveline Oehrlich is Chief Research and Content Officer at biologist and has studied the impact of stress and burnout
Heroku with a love for teaching and community-building. the DevOps Institute responsible for research and content in SRE since 2017. He continues to focus on mental health
strategy. Previously, she held the position of VP, Research and well-being, working toward a Certificate of MBSR
Mehdi Daoudi is the driving force behind Catchpoint’s Director and Principal Analyst at Forrester Research (mindfulness-based stress reduction) Facilitation from one
promise to help the Internet deliver on its potential for the delivering research, advisory and consulting for software of the field’s foremost programs.
human race. vendors, system integrators, service providers, and IT
enterprise organizations.

1 2 3 4 2021 SRE Report / 32


About Catchpoint
Catchpoint is the enterprise-proven Digital Experience Observability industry leader, empowering teams to confidently
own the end-user experience. We provide unparalleled visibility and insight into every critical system that collectively
produces and delivers digital experiences to customers and employees. Business leaders like Google, L’Oréal, Verizon,
Oracle, Equinix, Honeywell, and Priceline trust Catchpoint to proactively and rapidly detect and repair problems before
they impact users. With the largest observability network, broadest capabilities, and highest data quality in the industry,
Catchpoint is the ally you need to deliver on the unrelenting user experience expectations of today and tomorrow.
Learn more at www.catchpoint.com.

Methodology
For our fourth annual SRE Survey, we received 278 responses. All responses were organic and non-paid.
The survey was open for the month of April 2021. It presented a list of 40 questions that covered a wide range of topics, including the split between development and operational activities, the approach taken to tool usage, and the
type of monitoring tools and practices used. Qualifying questions such as, “What activities do you perform?” had both operational and development choices as answers, to build confidence in the quality of the received answers.
When analyzing the results of the survey, correlations between questions were actively made. Correlation does not necessarily mean causation, but it can yield fascinating insight into the nature of SRE approaches and activities.
To ensure efficacy and integrity of correlated questions, the same mathematical base was applied.
For any questions about this report or the data, please contact [email protected].

© 2021 Catchpoint Systems, Inc. All rights reserved. 210004-v3

You might also like