Skip to main content

SRE Learning Path

Build and operate reliable, scalable, and efficient systems.

Core SRE Principles

  • SLOs, SLIs, SLAs — Define and measure reliability
  • Error Budgets — Balance reliability with velocity
  • Toil Reduction — Automate operational work
  • Incident Management — Respond effectively to outages
  • Blameless Postmortems — Learn from failures

Stage 1: Foundations

  • Linux system administration
  • Networking and distributed systems
  • Programming (Python/Go)
  • Observability stack

Stage 2: Monitoring & Alerting

  • Prometheus metrics and queries
  • Grafana dashboards
  • Alert design and routing
  • On-call best practices

Stage 3: Reliability Practices

  • Capacity planning
  • Load testing and performance
  • Chaos Engineering
  • Disaster recovery and DR drills
  • Change management

Google SRE Book — Free Chapter Guide

The Site Reliability Engineering book by Google is available for free online. Below is the complete table of contents with direct links to every chapter.

All chapters link to sre.google/sre-book — read the full book for free.

Part I — Introduction

ChTitleKey Topics
1IntroductionWhat SRE is, how Google does it, tenets of SRE
2The Production Environment at GoogleHardware, software infrastructure, networking, Borg

Part II — Principles

ChTitleKey Topics
3Embracing RiskRisk tolerance, error budgets, balancing reliability vs. velocity
4Service Level ObjectivesSLIs, SLOs, SLAs, choosing targets, control loops
5Eliminating ToilDefining toil, measuring toil, automation strategies
6Monitoring Distributed SystemsSymptoms vs. causes, black-box/white-box monitoring, alerting
7The Evolution of Automation at GoogleAutomation hierarchy, platform-based automation
8Release EngineeringBuild/release pipeline, hermetic builds, release philosophy
9SimplicityBoring is good, minimal APIs, modularity, release simplicity

Part III — Practices

ChTitleKey Topics
10Practical AlertingBorgmon, time-series monitoring, alerting rules, dashboards
11Being On-CallOn-call balance, compensation, operational load, feeling safe
12Effective TroubleshootingProblem reports, triage, examine, diagnose, test/treat
13Emergency ResponseReal-world emergencies, learning from incidents, drills
14Managing IncidentsIncident command, roles, communication, declared incidents
15Postmortem Culture: Learning from FailureBlameless postmortems, collaboration, continuous improvement
16Tracking OutagesEscalator, Outalator, aggregation, tagging, reporting
17Testing for ReliabilityUnit/integration/system testing, production tests, canary
18Software Engineering in SRESRE as software engineers, Auxon case study
19Load Balancing at the FrontendDNS load balancing, virtual IPs, anycast
20Load Balancing in the DatacenterSubset selection, backend health, weighted round robin
21Handling OverloadClient-side throttling, criticality, graceful degradation
22Addressing Cascading FailuresCauses, prevention, mitigation, resource exhaustion
23Managing Critical StateDistributed consensus, Paxos, Raft, leader election
24Distributed Periodic SchedulingCron at scale, idempotency, large-scale scheduling
25Data Processing PipelinesPipeline design patterns, Workflow, periodic pipelines
26Data IntegrityBackups, recovery, replication, ACID, soft deletes
27Reliable Product Launches at ScaleLaunch coordination, checklists, progressive rollouts

Part IV — Management

ChTitleKey Topics
28Accelerating SREs to On-CallTraining, reverse engineering, shadowing, onboarding
29Dealing with InterruptsInterrupt management, flow state, operational vs. project work
30Embedding an SRE to Recover from Operational OverloadTeam health, SRE embedding, operational recovery
31Communication and Collaboration in SREProduction meetings, collaboration models, knowledge sharing
32The Evolving SRE Engagement ModelPRR model, early engagement, frameworks, SRE teams

Part V — Conclusions

ChTitleKey Topics
33Lessons Learned from Other IndustriesAviation, healthcare, nuclear — parallels to SRE
34ConclusionSRE future, key takeaways

Appendices

IDTitleDescription
AAvailability TableNines of availability with downtime calculations
BBest Practices for Production ServicesCollected best practices checklist
CExample Incident State DocumentTemplate for tracking incidents
DExample PostmortemFull postmortem example from Shakespeare service
ELaunch Coordination ChecklistPre-launch and launch-day checklist
FExample Production Meeting MinutesHow Google runs production review meetings

BookAuthor
Site Reliability EngineeringGoogle (Betsy Beyer et al.) — Free online
The Site Reliability WorkbookGoogle (Betsy Beyer et al.) — Free online
Building Secure and Reliable SystemsGoogle (Heather Adkins et al.) — Free online
Implementing Service Level ObjectivesAlex Hidalgo
Observability EngineeringCharity Majors et al.

All SRE content from these books has been integrated into the SRE Practices guide.