SRE Learning Path

Build and operate reliable, scalable, and efficient systems.

Core SRE Principles

SLOs, SLIs, SLAs — Define and measure reliability
Error Budgets — Balance reliability with velocity
Toil Reduction — Automate operational work
Incident Management — Respond effectively to outages
Blameless Postmortems — Learn from failures

Stage 1: Foundations

Linux system administration
Networking and distributed systems
Programming (Python/Go)
Observability stack

Stage 2: Monitoring & Alerting

Prometheus metrics and queries
Grafana dashboards
Alert design and routing
On-call best practices

Stage 3: Reliability Practices

Capacity planning
Load testing and performance
Chaos Engineering
Disaster recovery and DR drills
Change management

Google SRE Book — Free Chapter Guide

The Site Reliability Engineering book by Google is available for free online. Below is the complete table of contents with direct links to every chapter.

All chapters link to sre.google/sre-book — read the full book for free.

Part I — Introduction

Ch	Title	Key Topics
1	Introduction	What SRE is, how Google does it, tenets of SRE
2	The Production Environment at Google	Hardware, software infrastructure, networking, Borg

Part II — Principles

Ch	Title	Key Topics
3	Embracing Risk	Risk tolerance, error budgets, balancing reliability vs. velocity
4	Service Level Objectives	SLIs, SLOs, SLAs, choosing targets, control loops
5	Eliminating Toil	Defining toil, measuring toil, automation strategies
6	Monitoring Distributed Systems	Symptoms vs. causes, black-box/white-box monitoring, alerting
7	The Evolution of Automation at Google	Automation hierarchy, platform-based automation
8	Release Engineering	Build/release pipeline, hermetic builds, release philosophy
9	Simplicity	Boring is good, minimal APIs, modularity, release simplicity

Part III — Practices

Ch	Title	Key Topics
10	Practical Alerting	Borgmon, time-series monitoring, alerting rules, dashboards
11	Being On-Call	On-call balance, compensation, operational load, feeling safe
12	Effective Troubleshooting	Problem reports, triage, examine, diagnose, test/treat
13	Emergency Response	Real-world emergencies, learning from incidents, drills
14	Managing Incidents	Incident command, roles, communication, declared incidents
15	Postmortem Culture: Learning from Failure	Blameless postmortems, collaboration, continuous improvement
16	Tracking Outages	Escalator, Outalator, aggregation, tagging, reporting
17	Testing for Reliability	Unit/integration/system testing, production tests, canary
18	Software Engineering in SRE	SRE as software engineers, Auxon case study
19	Load Balancing at the Frontend	DNS load balancing, virtual IPs, anycast
20	Load Balancing in the Datacenter	Subset selection, backend health, weighted round robin
21	Handling Overload	Client-side throttling, criticality, graceful degradation
22	Addressing Cascading Failures	Causes, prevention, mitigation, resource exhaustion
23	Managing Critical State	Distributed consensus, Paxos, Raft, leader election
24	Distributed Periodic Scheduling	Cron at scale, idempotency, large-scale scheduling
25	Data Processing Pipelines	Pipeline design patterns, Workflow, periodic pipelines
26	Data Integrity	Backups, recovery, replication, ACID, soft deletes
27	Reliable Product Launches at Scale	Launch coordination, checklists, progressive rollouts

Part IV — Management

Ch	Title	Key Topics
28	Accelerating SREs to On-Call	Training, reverse engineering, shadowing, onboarding
29	Dealing with Interrupts	Interrupt management, flow state, operational vs. project work
30	Embedding an SRE to Recover from Operational Overload	Team health, SRE embedding, operational recovery
31	Communication and Collaboration in SRE	Production meetings, collaboration models, knowledge sharing
32	The Evolving SRE Engagement Model	PRR model, early engagement, frameworks, SRE teams

Part V — Conclusions

Ch	Title	Key Topics
33	Lessons Learned from Other Industries	Aviation, healthcare, nuclear — parallels to SRE
34	Conclusion	SRE future, key takeaways

Appendices

ID	Title	Description
A	Availability Table	Nines of availability with downtime calculations
B	Best Practices for Production Services	Collected best practices checklist
C	Example Incident State Document	Template for tracking incidents
D	Example Postmortem	Full postmortem example from Shakespeare service
E	Launch Coordination Checklist	Pre-launch and launch-day checklist
F	Example Production Meeting Minutes	How Google runs production review meetings

Book	Author
Site Reliability Engineering	Google (Betsy Beyer et al.) — Free online
The Site Reliability Workbook	Google (Betsy Beyer et al.) — Free online
Building Secure and Reliable Systems	Google (Heather Adkins et al.) — Free online
Implementing Service Level Objectives	Alex Hidalgo
Observability Engineering	Charity Majors et al.

Core SRE Principles​

Stage 1: Foundations​

Stage 2: Monitoring & Alerting​

Stage 3: Reliability Practices​

Google SRE Book — Free Chapter Guide​

Part I — Introduction​

Part II — Principles​

Part III — Practices​

Part IV — Management​

Part V — Conclusions​

Appendices​

Recommended Reading​