Catchpoint 2021 SRE Report
Catchpoint 2021 SRE Report
SRE Report
FREE
TO BE AN
SRE
Contributing
Partners
DevOps
INSTITUTE
Foreword
How can leaders grow digital business services the results, you will also discover interesting changes
without incurring the problems associated with from the previous year’s survey, not to mention some
rapidly increasing scale, such as increased cost, noteworthy correlations.
risk, and complexity? The obvious part of a larger
answer (especially given the name of this report) Why should you read this report?
is to implement site reliability engineering (SRE). • Find out about the expected patterns and
surprising anti-patterns of SRE.
SRE is a discipline which fundamentally asks software
• Gain insight into the industry’s most
engineers to design operations teams. It applies
comprehensive SRE tenet baselining data.
aspects of software engineering to infrastructure and
operations problems. The result: ultra-scalable and • Understand usage of the five major monitoring
exceptionally reliable distributed software systems. components (digital experience, application,
infrastructure, AIOps, and network).
SRE complements DevOps by measuring and
• Learn from an example value stream to bridge the
achieving the reliability of applications and services
IT-to-business conversation gap.
working in production and DevOps infrastructures
in a prescribed manner. It combines the use of Before continuing, you should relax your baselining
error budgets, team relationships brokered by efforts when trying to match the Google definition
collaboration, ops-as-code, and reliability control of SRE. There are too many dimensions for a one-
practices to ensure deployments meet Service Level size-fits-all approach.
Objectives (SLOs).
You are hereby free to be an SRE.
In analyzing the data from almost 300 practitioners
and experts who participated in our survey on the
state of SRE, we found some familiar trends and Mehdi Daoudi
some provocative anti-patterns. As you go through CEO, Catchpoint
As we consider the past year’s global events, current emerging trends, and future thoughts on how SRE
implementations will change the world, we hereby declare:
SRE Report Passioneers SRE Report Contributors Supporting Partners Special Thanks
Thank you to Leo Vasiliou, Eveline Oehrlich, and Thank you to Holly Chessman, Laura Meyer, Neil Thank you from all of us at Catchpoint to Thank you to our industry experts: Kurt Andersen,
Anna Jones. This report would not have been Myers, Gordana Neskovic, and Stela Udovicic. DevOps Institute and VMware Tanzu. The Simon Aronsson, Helen Beal, Johnny Boursiquot,
written without your passion and inspiration. Your contributions shaped this entire report. sum of our partners is greater than the whole. J. Bobby Dorlus, Tamara Miner, Sanjeev Sharma,
Gaurav Shukla, and Jaime Woo. Your wisdom is
an invaluable part of this year’s report.
Many companies don’t Companies are using a AI is exciting but it has SREs should look to expand the
establish comparative multi-provider strategy not yet been fully adopted boundaries of observability
baselines for SRE conditions across the board or embraced so they can become more
customer-focused
If you don’t have a baseline, it’s hard From cloud to DNS to API to CDN, it’s 53% of SREs say the number one cloud
to tell what improves. For example, our all about the numbers. But managing application monitoring challenge Blending business, performance, and
survey showed that the average drop a variety of providers is complicated, is unified visibility across the stack. operations insight with monitoring is key
in toil from last year was 15%. Was this making the need for Platform Ops When monitoring and management to growing the SRE role. To make SRE
due to COVID-19, people doing less busy rise to the fore. While 53% of SREs rank strategies get to the point that they can successful, SREs need to tie their success
work at home, and the work feeling “providing trainings on third-party depend on artificial intelligence and back to the business and get business
more meaningful? Will self-reported toil platform capabilities” as a minor or not machine learning to improve decision leaders onboard with SRE practices.
rise next year as offices open and the applicable activity, maybe it’s time to making and automation, that problem That will be of huge benefit to SREs,
pain, problems, and challenges of the re-think that stance. will be vastly reduced. But for the most businesses, and customers alike.
year before are reintroduced? part, we’re not there yet.
We recommend companies ensure they baseline the amounts of time being spent by SREs on any
single category of activity. Then, compare internal baselines with industry data, such as this SRE Report.
If the baseline is leaning too far from the “50/50 split,” with ops predominating, the nature of SRE
work needs to be rethought. This starts with the measurement of how SRE time is spent (since what is
measured is what will improve).
This is why the desire to “automate ALL the things!” is a Experimenting or receiving training to expand knowledge or skills
core SRE tenet. People want to give themselves the best 10% 23% 40% 27%
chance of focusing on value-based activities. In other
Authoring business processes, rules, or best practices
words, for SREs, the focus of automation is less about
9% 20% 45% 26%
deploying code or releases as fast as possible. Instead,
it is aimed at delivering value to the customer with a risk- 9%
Performing audits of usage/cost allocation
mitigating, scalable approach that balances the need for 13% 35% 31% 21%
agility against the need for stability.
Spinning up new hosts/instances
For the first key finding, let’s explore the core tenets of 17% 32% 30% 21%
SRE, including the amount of time SREs are spending Planning release roadmaps
on toil, the Dev versus Ops split, and the use of error 22% 30% 28% 20%
budgets. After all, how can you measure improvements
without a baseline? Performing chaos engineering exercises
24% 33% 26% 17%
Causes Of Toil Minor Moderate Major N/A Why Did Toil Numbers Drop?
Too much Only 9% of SREs said COVID was a major factor for
technical debt 17% 36% 42% 5% high amounts of toil. This leads us to theorize that the
universal reported drop in toil is related to the global
Priorities or goals
are not aligned 21% 40% 32% 7% shift to working from home over the last year. Did the
shift in related working conditions and priorities mean
The business value that work felt more meaningful and therefore less like
to fix is hard to realize 23% 40% 28% 9%
toil? That could be the case, but we’re not sure.
Lack of training So, will toil remain lower? Probably not. We expect the
or support 28% 43% 20% 9%
amount to rise again next year when a hybrid office
Lack of is more of a reality. Note that the “hybrid office” could
collaboration 35% 35% 17% 13% involve employees continuing to work from home,
returning to the office full- or part-time, or some
COVID 49% 15% 9% 27% combination of the two.
% of time
time spent on call
the elusive 50/50 dev versus ops split. The drive to solve complex 50
problems should always outweigh semi-arbitrary decisions, such
as how an organization is structured or whether a business should 40
mandate the use of open source for all tool selection. 30
20
SRE is fundamentally more about people and processes than
technology. It’s a good sign that SREs across the globe are 10
spending time on development activities. The purpose of SRE
0
is to bridge the gap between platform design, development, 0 10 20 30 40 50 60 70 80 90 100
and operational execution. This means SREs have to shift-left
% of SREs
(which they are doing) into development to share the wisdom of
production to those teams delivering value-adding products and
services to that production environment.
“
Sanjeev
Most organizations and enterprises, no matter how small or large, are not like Google. Small
and medium organizations don’t have Google’s scale. And large enterprises are not homogenous
Sharma
SVP, Head of Automation and
in their technology stacks or architecture like Google. So, yes, they all need SRE, just not in the same Platform Engineering, Truist Financial
manner as Google does. They need to automate repetitive, typical tasks in operations in order to @sd_architect
improve resilience and reliability and can learn from Google’s take on SRE. They just cannot, and SRE from: Ashburn, VA
should not, emulate how Google implemented SRE. Favorite: All things SciFi and
Superhero — from James T.
Kirk to Loki
Jaime Woo
“
Co-editor, 97 Things
Every SRE Should Know
@jaimewoo
SRE from: Toronto, Canada
If Google is correct that too much toil leads to frustration, boredom, and burnout, could it be that
Favorite: Haruki Murakami’s
What I Talk About When I Talk
during a pandemic, with its enduring waves of stress, SREs chose to spend their time elsewhere? Either
About Running way, did anyone even notice — and what does that say about toil?”
DNS API CDN Cloud Data Center Low Moderate High N/A
CENTRALIZED SRE BUT SUPPORTING VARIOUS DECENTRALIZED SRE ASSIGNED BY PRODUCT THE SRE AUTOMATION TOOLS TEAM ESTABLISHES
PRODUCTS OR PLATFORMS. OR ANOTHER ATTRIBUTE. AND GUIDES IT AUTOMATION BEST PRACTICES.
An SRE is typically associated with various products, These SREs are typically embedded into application/ Another model is that of the SRE tools team, which
applications, or platforms. These types of teams have service teams to support the application team and are supports its software development counterparts in
SRE responsibility across a specific set of services responsible for automating operations (runbooks) and improving system reliability through automation
and workflows with key boundaries. Here the SREs deployment work, including monitoring and metrics work, leveraging existing tools, or creating scripts.
are functioning as an enabler to help the product population, as well as owning incident management Possible work tasks could include focusing on
or platform teams within the technical or product around application issues. The advantage of this monitoring, observability, and analysis across a
domain. The work of these SRE team members could decentralized model is a solid integration within the particular application stack or the instrumentation
include a variety of projects, such as the onboarding business, with strong business value alignment and of key infrastructure ecosystems as a best practice
of specific technologies like Kubernetes, building out a new career path evolution for the embedded SRE. for other teams to leverage.
a mature infrastructure service mesh, working on SREs can establish best practices, benchmarks, and
continuous integration activities, or providing 24x7 automation architectures across the entire enterprise.
support for specific applications and services.
42%
Decentralized 23% Decentralized by business product/service
SRE Topology/Organization
One way of preventing Size Operations team. Such a plan allows
silos from occurring is by creating a Platform
SRE Topology/Number
decentralized teams
of Employees
the autonomy and flexibility to do their work, while the centralized Platform Ops team provides capabilities to its decentralized
colleagues. These normalized capabilities can then be integrated across workflows.
% of respondents
38% 36%
35%
31% 30% 26% 22%
23% 23%
26%
20%20% 20% 15% 13%
17%
12%
1 2 3
different pipelines and products. A Platform Ops
team can focus on creating self-service capabilities by
applying automation for application developers. This
process can encompass everything needed to build an
automated DevOps value stream.
“
Kurt
Spanning the gaps between the interfaces and the data that each provider offers Andersen
increases the difficulty for SRE teams to automate across those multiple providers. These SRE Architect, Blameless
integrations are rarely simple except for the most superficial aspects. Effectively mapping @drkurt
disparate data models together may be the next frontier for SRE in a multi-vendor environment.” SRE from: North Idaho, ID
Favorite quote: “There is only
one root cause: change.”
Dave Zweiback, Beyond Blame,
O’Reilly, October 2015
Johnny
“
Boursiquot
SRE, Heroku
@jboursiquot
Chances are high you’ll need a tailored SRE model for your organization. Be aware that the
SRE from: Baltimore, MD
effort you put into building this custom SRE practice will only bear fruit when all parties affected have
Favorite philosophy: Amor Fati
a shared understanding of existing shortcomings and how outcomes will be measured and reflected
while on the journey. Nothing kills the effectiveness of SRE adoption faster than a misalignment on its
aims and a lack of transparency on the value it provides specifically to your organization.”
We recommend SREs break down the broad AIOps category into smaller components, such
as event correlation, topology analytics, or smart alerting. Once these components have been
identified, incrementally develop AIOps capabilities through an investment in training and the right
tooling to help manage the continually increasing volume, velocity, and plethora of data sources.
The promise of specific AIOps components — event correlation, for Application performance monitoring (e.g., tracing or events)
instance — is huge in helping to parse through big data to arrive 4% 16% 32% 49%
quickly at the important information or (even more critical) the
Digital experience monitoring (e.g., Synthetics or RUM)
prescribed action. However, while you will need AIOps to make
21% 24% 30% 26%
sense of all this data and to reduce toil, not all components are
equally helpful. For instance, vendor promises of self-healing and
automatic remediation should be heavily scrutinized if you are
Received AIOps Value
Artificial intelligence for ITOps (e.g., anomaly detection or self-healing)
39% 27% 23% 12%
considering them to be part of AIOps’ received value.
Public/social sentiment monitoring
Lack of unified Huge effort in Lack of Monitoring tools I/We face no cloud Other
visibility across maintaining analytics. do not scale. application monitoring
the stack. existing tool(s). challenges.
SRE Spotlight
“
Simon
During the last year, most of us have had to adjust to spending a lot more time in our homes. Aronsson
We believe this transition to be one of the primary explanations for the significant decrease in how Head of Developer Relations, k6
much of our work we consider to be toil.
@0x12b
As we slowly start recovering from these trying times, some of the work we’ve considered as good SRE from: Norrköping, Sweden
alternatives to being isolated will likely begin to feel like toil again. Favorite movies: Hackers (1995)
and WarGames
We expect site reliability to stay mission-critical moving forward, justifying continued efforts in
eliminating toil. Organizations that properly manage to identify activities suitable for automation and
AIOps will have an excellent opportunity to keep the perceived level of toil low even as we readjust.”
“ Today, AIOps solutions elevate SRE skills by automating incident responses. They can proactively
monitor the SRE’s golden signals and measure what really matters for customer experience, but if
not introduced correctly, can backfire, ending up in a more complex system to be managed! So how
can this adoption be simplified? To start with, we need to have a closer look at our current toolset and
evaluate where we have gaps, taking into account common features that current AIOps products
offer, such as baselining, RCA, anomaly detection, event correlation, simulation, etc. It’s all about
getting better at understanding and managing the complexity of your setup and then
integrating automated helpers and actions.”
Gaurav
Shukla
SRE, Catchpoint
SRE from: Delhi, India
Favorite: The Pursuit of
Happiness, Seven Pounds
J. Bobby
“
Dorlus
SRE, Twitter
AIOps should be part of any SRE toolkit or any infrastructure managing large scale systems.
@BobbyD_FL
Greater adoption is coming. It aligns with some of the challenges I’ve experienced at Twitter with
SRE from: West Palm Beach, FL
managing large scale systems and in particular, the vast amounts of data, metrics, and logs we need
Favorite: Independence Day,
I, Robot, I am Legend to work through. How can we parse through all that as humans and be as effective as our customers
need us to be? Most SREs working at scale are already leveraging machine learning, especially when
it comes to efficiencies around data centers (locations, cooling, and all the things that happen inside it),
for networks and building out infrastructure … Evolving that into AIOps is the next logical step.”
IT-to-business conversations should start around capabilities — the gateway to positive business
outcomes and a middle point in the value stream. Talking about necessary capabilities allows the
conversation to shift as needed, helping shrink the IT-to-business gap.
This is the direction in which business leaders typically take the their activities and avoid “blind luck” scenarios where they are
41%
conversation. They make more outward-facing, value-based not quite sure how value was realized (even though they must How quickly we do root
decisions like how to land new logos or retain existing revenue. reproduce that value over and over). cause incident analysis
The result is an often large and frustrating gap or the less-than-
optimal use of resources to accomplish goals or objectives.
40% How quickly we push
product updates
IT-Focused Business Value-Focused
33%
How quickly our
business can expand
to new markets
22%
How quickly we can
understand the cause of
social media sentiment
Uptime and Customer Reliability and Capabilities Business Value
Other Service Experiences Resiliency Outcomes Realization
Level Indicators and Reach
“ In addition to understanding how business value ties back to SRE SLIs, it is important to close the
feedback loop by communicating actual customer sentiment and felt value back to the engineers working
on improvements in these areas.
Recently at Improbable, we received some feedback from a customer around how happy they’d been with
our team addressing key SRE metrics (e.g., shorter deployment times and reduced number of incidents). Our
proactiveness in soliciting feedback via customer satisfaction surveys and ad hoc meetings was called out
as a huge plus, allowing them to focus on improving their customer experience and increasing their scale
significantly. Instead of that feedback stopping at the Account Manager, we ensured we had a closed loop
Tamara
Miner
Engineering Manager,
Improbable
@tammasaurusrex
SRE from: London, UK
process, so our engineers could see the impact of their work and motivate them to anticipate customer
Favorite: Erin Meyer’s
needs in future development.” The Culture Map
Helen
“
Beal
Chief Ambassador,
DevOps Institute People don’t have time to analyze data all the time — they’re busy dealing with incidents and trying to
@BealHelen make improvements. And all those different monitoring tools feeding the observability frameworks… It seems
SRE from: Chichester, UK there’s a gap — we’ve got the data, we’ve got the performance model, but we haven’t connected the two. The
Favorite: Yann Martel’s observability and AIOps framework vendors need to up their game for the SRE personas and create out-of-
Life of Pi the-box patterns for the management of these all-important SRE metrics.”
• As people use more multi-provider third-party strategies, they buy more than they build.
• The more decentralized, the more time on call. We hereby
• Xbox users have a greater preference for Macs over their PlayStation counterparts.
declare by the
In our desire to evolve the SRE field, we’ll publish additional findings from the report over time. To stay up
on our findings, please subscribe to our blog and follow us on LinkedIn or Twitter. power vested in us
One of the underlying themes for this year’s Report was to look at the five major monitoring components (by no one at all),
(digital experience, application, infrastructure, AIOps, and network) and understand how the promise of
AIOps will help them fit together. In order to run, you must first walk through drafting an established line in
you are free to
the sand, i.e., baseline at the outset so you know where your journeys begin. be an SRE.
As we’ve determined, SREs must attach to business value conversations. It’s also important to consider the
different perspectives of, and empathize with, different teams. Your IT counterparts, for example, may not
have the development and engineering resources that a well-honed, well-tuned SRE may have. Consider
offering talent sharing programs, because your development experience may exponentially increase their
ability to improve the efficiency of their programs. In other words, an ounce of development to you is worth
a pound of development to others.
This brings us full circle to one of our core SRE philosophies, which is the common, passionate desire to
solve complex problems.
then media or entertainment. Over half the respondents cited IT operations as their primary area of expertise, closely Financial Services 13%
followed by application or software development/engineering and IT infrastructure. Sixty percent of respondents said Media or Entertainment 8%
they had more than one area of expertise. Telecom 6%
Manufacturing 5%
What is your role? What is your primary area of expertise?
Consumer Packaged Goods or Retail 5%
SRE individual practitioner/subject matter expert 47% IT Operations 52%
Healthcare or Chemicals 4%
Team leader/supervisor 16% Application or Software Development/Engineering 50%
Professional Services or Consulting 4%
Manager 14% IT Infrastructure 47%
Energy 3%
Senior management (Director, Vice President) 9% Architect 27%
Transportation 3%
External consultant/contractor/coach 5% Network Operations or Engineering 20%
Government or Non-profit 3%
C-Suite executive 3% Security 16%
Travel or Accommodation <1%
Other 6% Database Engineer 11%
Other 4%
Service Desk or Support 9%
How many employees does your company have?
CIO, CTO, CXO, or other C-Suite Executive 4% Where are you located?
One to 100 18% Other 4% North America 48%
101 - 1,000 29%
Asia 24%
1,001 - 10,000 23% How many SREs are in your organization?
Europe 21%
10,001 - 100,000 21% One to ten 52%
Oceania 3%
More than 100,000 9% 11 to 100 34%
South America 3%
101 to 1,000 11%
Africa 1%
More than 1,000 4%
Methodology
For our fourth annual SRE Survey, we received 278 responses. All responses were organic and non-paid.
The survey was open for the month of April 2021. It presented a list of 40 questions that covered a wide range of topics, including the split between development and operational activities, the approach taken to tool usage, and the
type of monitoring tools and practices used. Qualifying questions such as, “What activities do you perform?” had both operational and development choices as answers, to build confidence in the quality of the received answers.
When analyzing the results of the survey, correlations between questions were actively made. Correlation does not necessarily mean causation, but it can yield fascinating insight into the nature of SRE approaches and activities.
To ensure efficacy and integrity of correlated questions, the same mathematical base was applied.
For any questions about this report or the data, please contact [email protected].