Site reliability engineering explained

Site Reliability Engineering (SRE) is a subset of web development that encompasses principles and practices that integrate software engineering with IT infrastructure and operations^[1] to enhance system reliability. SRE shares some similarities with DevOps, which focuses on software development and operational practices.

History

The field of SRE originated at Google with Ben Treynor Sloss,^[2] ^[3] who founded a site reliability team in 2003.^[4] The concept expanded within the software development industry, leading various companies to employ site reliability engineers.^[5] By March 2016, Google had over 1,000 site reliability engineers on staff.^[6] Dedicated SRE teams are common at larger web development companies. DevOps teams sometimes serve the dual purpose of SRE in midsize and smaller companies. Organizations that have adopted the concept include Airbnb, Dropbox, IBM,^[7] LinkedIn,^[8] Netflix, and Wikimedia.^[9]

Definition

Site reliability engineering as a job role may be performed by individual contributors or organized in teams. Site reliability engineers are responsible for a combination of the following within a broader engineering organization: system availability, latency, performance, efficiency, change management, monitoring, emergency response, and capacity planning.^[10] Site reliability engineers often have backgrounds in software engineering, system engineering, or system administration.^[11] Focuses of SRE include automation, system design, and improvements to system resilience.

The set of principles and practices in site reliability engineering can be performed by regular workers, but a company may eventually hire specialists and engineers specifically for the job.

SRE is considered a specific implementation of DevOps;^[12] SRE focuses specifically on building reliable systems, whereas DevOps covers a broad scope.^[13] ^[14] ^[15] Despite having different focuses, some companies have re-branded their operations teams to SRE teams.

Principles and practices

There have been multiple attempts to define a canonical list of site reliability engineering principles. The following characteristics are included in most definitions:^[16]

Automation of repetitive tasks for cost-effectiveness
Limit the pursuit of reliability to the pre-defined reliability goals. Defining these reliability goals is one of the SRE practices (see list of practices below).
Design of systems with a bias toward the reduction of risks to availability, latency, and efficiency.
Observability, the ability to ask arbitrary questions about a system without having to know ahead of time what to ask.^[17]

The site reliability engineering practices also vary widely, but the list below is relatively commonly seen as at least partially implemented:

Toil management, the implementation of the first principle outlined above.
Defining and measuring reliability goals—SLIs, SLOs, and error budgets.
Non-Abstract Large Scale Systems Design (NALSD) with a focus on reliability.
Designing for and implementing observability.
Defining, testing, and running an incident management process.
Capacity planning.
Change and release management, including CI/CD.
Chaos engineering.

Implementations

SRE teams collaborate with other departments within organizations to implement principles effectively. Below is an overview of common practices:^[18]

Kitchen Sink, a.k.a. "Everything SRE"

Kitchen Sink refers to the expansive and often unbounded scope of services and workflows that SRE teams oversee. Unlike traditional roles with clearly defined boundaries, SREs are tasked with various responsibilities, including system performance optimization, incident management, and automation. This approach allows SREs to address multiple challenges, ensuring that systems run efficiently and evolve in response to changing demands and complexities.

Infrastructure

Infrastructure SRE teams focus on maintaining and improving the reliability of systems that support other teams' workflows. While they sometimes collaborate with platform engineering teams, their primary responsibility is ensuring up-time, performance, and efficiency. Platform teams, on the other hand, primarily develop the software and systems used across the organization. While reliability is a goal for both, platform teams prioritize creating and maintaining the tools and services used by internal stakeholders, whereas Infrastructure SRE teams are tasked with ensuring those systems run smoothly and meet reliability standards.

Tools

SRE teams utilize a variety of tools to measure, maintain, and enhance system reliability. These tools play a role in monitoring performance, identifying issues, and facilitating proactive maintenance. For instance, Nagios Core is widely used for system monitoring and alerting, while Prometheus (software) is popular for collecting and querying metrics in cloud-native environments.

Product or application

SRE teams dedicated to specific products or applications are common in large organizations.^[19] These teams are responsible for ensuring the reliability, scalability, and performance of key services. In larger companies, it's typical to have multiple SRE teams, each focusing on different products or applications, ensuring that each area receives specialized attention to meet performance and availability targets.

Embedded

In an embedded model, individual SREs or small SRE pairs are integrated within software engineering teams. These SREs work closely with developers, applying core SRE principles, such as automation, monitoring, and incident response—directly to the software development lifecycle. This approach helps improve reliability, performance, and collaboration between SREs and developers.

Consulting

Consulting SRE teams specialize in advising organizations on the implementation of SRE principles and practices. Typically composed of seasoned SREs with a history across various implementations, these teams provide insights and guidance for specific organizational needs. When working directly with clients, these SREs are often referred to as 'Customer Reliability Engineers.'

In large organizations that have adopted SRE, a hybrid model is common. This model includes various implementations, such as multiple Product/Application SRE teams dedicated to addressing the unique reliability needs of different products. An Infrastructure SRE team may collaborate with a Platform engineering group to achieve shared reliability goals for a unified platform that supports all products and applications.

Industry

Since 2014, the USENIX organization has hosted the annual SREcon conference, bringing together site reliability engineers from various industries. This conference is a platform for professionals to share knowledge, explore best practices, and discuss trends in site reliability engineering.^[20]

External links

Awesome Site Reliability Engineering resources list
How they SRE resources list
SRE Weekly weekly newsletter devoted to SRE
SRE at Google landing page for learning more about SRE in Google
Komodor K8s Reliability learning centre with resources for SREs working with Kubernetes
SRE: What Do You Need To Know To Master This Role? resource list

Notes and References

Web site: Evaluating where your team lies on the SRE spectrum . 2021-06-26 . Google Cloud Blog . en.
Web site: Hill. Patrick. Love DevOps? Wait until you meet SRE. June 17, 2021. Atlassian. en.
Web site: What is SRE?. June 17, 2021. Red Hat. en.
Web site: Treynor. Ben. 2014. Keys to SRE. June 17, 2021. USENIX SREcon14.
Web site: Gossett . Stephen . June 1, 2020 . What Is a Site Reliability Engineer? What Does an SRE Do? . June 17, 2021 . Built In . en.
Web site: Fischer. Donald. March 2, 2016. Are site reliability engineers the next data scientists?. June 17, 2021. TechCrunch. en-US.
Web site: November 12, 2020. Site Reliability Engineering. June 21, 2021. IBM Cloud Education. IBM. en.
Web site: Site Reliability Engineering (SRE). March 12, 2024. engineering.linkedin.com.
Web site: SRE - Wikitech. 2021-10-17. wikitech.wikimedia.org. en.
Treynor. Ben. Niall Murphy. In Conversation. Google Site Reliability Engineering.
Jones. Chris. Underwood. Todd. Nukala. Shylaja. June 2015. Hiring Site Reliability Engineers. . 40. 35–39. June 17, 2021. 3.
Web site: Interview with Betsy Beyer, Stephen Thorne of Google . 9 Oct 2018 . Dave Harrison . 24 July 2024.
Book: Site Reliability Engineering: How Google Runs Production Systems . . 2016 . 978-1-4919-5118-7 . Beyer . Betsy . Sebastopol, CA . 945577030 . Jones . Chris . Petoff . Jennifer . Murphy . Niall.
What's the Difference Between DevOps and SRE? (class SRE implements DevOps) . March 1, 2018 . Vargo . Seth . Fong-Jones . Liz . Liz Fong-Jones . Video . Google.
Web site: What is SRE? - SRE Explained - AWS . 2022-11-05 . Amazon Web Services, Inc. . en-US.
Web site: The 7 SRE Principles [And How to Put Them Into Practice] ]. 2021-06-26 . www.blameless.com . en.
Web site: Learn about observability Honeycomb. 2021-06-26. docs.honeycomb.io. en.
Web site: SRE at Google: How to structure your SRE team . 2021-06-26 . Google Cloud Blog . en.
Web site: SRE at Google: How to structure your SRE team . 2024-11-11 . Google Cloud Blog . en-US.
Web site: . 2021 . Usenix SREcon . June 17, 2021 . USENIX.
Web site: Beres . Cristi . SRE & DevOps: Striking the Perfect IT Match . Synergo Group . Synergo Group . 27 November 2024.