SRE Learning Path
Build and operate reliable, scalable, and efficient systems.
Core SRE Principles
- SLOs, SLIs, SLAs — Define and measure reliability
- Error Budgets — Balance reliability with velocity
- Toil Reduction — Automate operational work
- Incident Management — Respond effectively to outages
- Blameless Postmortems — Learn from failures
Stage 1: Foundations
- Linux system administration
- Networking and distributed systems
- Programming (Python/Go)
- Observability stack
Stage 2: Monitoring & Alerting
- Prometheus metrics and queries
- Grafana dashboards
- Alert design and routing
- On-call best practices
Stage 3: Reliability Practices
- Capacity planning
- Load testing and performance
- Chaos Engineering
- Disaster recovery and DR drills
- Change management
Google SRE Book — Free Chapter Guide
The Site Reliability Engineering book by Google is available for free online. Below is the complete table of contents with direct links to every chapter.
All chapters link to sre.google/sre-book — read the full book for free.
Part I — Introduction
| Ch | Title | Key Topics |
|---|---|---|
| 1 | Introduction | What SRE is, how Google does it, tenets of SRE |
| 2 | The Production Environment at Google | Hardware, software infrastructure, networking, Borg |
Part II — Principles
| Ch | Title | Key Topics |
|---|---|---|
| 3 | Embracing Risk | Risk tolerance, error budgets, balancing reliability vs. velocity |
| 4 | Service Level Objectives | SLIs, SLOs, SLAs, choosing targets, control loops |
| 5 | Eliminating Toil | Defining toil, measuring toil, automation strategies |
| 6 | Monitoring Distributed Systems | Symptoms vs. causes, black-box/white-box monitoring, alerting |
| 7 | The Evolution of Automation at Google | Automation hierarchy, platform-based automation |
| 8 | Release Engineering | Build/release pipeline, hermetic builds, release philosophy |
| 9 | Simplicity | Boring is good, minimal APIs, modularity, release simplicity |
Part III — Practices
| Ch | Title | Key Topics |
|---|---|---|
| 10 | Practical Alerting | Borgmon, time-series monitoring, alerting rules, dashboards |
| 11 | Being On-Call | On-call balance, compensation, operational load, feeling safe |
| 12 | Effective Troubleshooting | Problem reports, triage, examine, diagnose, test/treat |
| 13 | Emergency Response | Real-world emergencies, learning from incidents, drills |
| 14 | Managing Incidents | Incident command, roles, communication, declared incidents |
| 15 | Postmortem Culture: Learning from Failure | Blameless postmortems, collaboration, continuous improvement |
| 16 | Tracking Outages | Escalator, Outalator, aggregation, tagging, reporting |
| 17 | Testing for Reliability | Unit/integration/system testing, production tests, canary |
| 18 | Software Engineering in SRE | SRE as software engineers, Auxon case study |
| 19 | Load Balancing at the Frontend | DNS load balancing, virtual IPs, anycast |
| 20 | Load Balancing in the Datacenter | Subset selection, backend health, weighted round robin |
| 21 | Handling Overload | Client-side throttling, criticality, graceful degradation |
| 22 | Addressing Cascading Failures | Causes, prevention, mitigation, resource exhaustion |
| 23 | Managing Critical State | Distributed consensus, Paxos, Raft, leader election |
| 24 | Distributed Periodic Scheduling | Cron at scale, idempotency, large-scale scheduling |
| 25 | Data Processing Pipelines | Pipeline design patterns, Workflow, periodic pipelines |
| 26 | Data Integrity | Backups, recovery, replication, ACID, soft deletes |
| 27 | Reliable Product Launches at Scale | Launch coordination, checklists, progressive rollouts |
Part IV — Management
| Ch | Title | Key Topics |
|---|---|---|
| 28 | Accelerating SREs to On-Call | Training, reverse engineering, shadowing, onboarding |
| 29 | Dealing with Interrupts | Interrupt management, flow state, operational vs. project work |
| 30 | Embedding an SRE to Recover from Operational Overload | Team health, SRE embedding, operational recovery |
| 31 | Communication and Collaboration in SRE | Production meetings, collaboration models, knowledge sharing |
| 32 | The Evolving SRE Engagement Model | PRR model, early engagement, frameworks, SRE teams |
Part V — Conclusions
| Ch | Title | Key Topics |
|---|---|---|
| 33 | Lessons Learned from Other Industries | Aviation, healthcare, nuclear — parallels to SRE |
| 34 | Conclusion | SRE future, key takeaways |
Appendices
| ID | Title | Description |
|---|---|---|
| A | Availability Table | Nines of availability with downtime calculations |
| B | Best Practices for Production Services | Collected best practices checklist |
| C | Example Incident State Document | Template for tracking incidents |
| D | Example Postmortem | Full postmortem example from Shakespeare service |
| E | Launch Coordination Checklist | Pre-launch and launch-day checklist |
| F | Example Production Meeting Minutes | How Google runs production review meetings |
Recommended Reading
| Book | Author |
|---|---|
| Site Reliability Engineering | Google (Betsy Beyer et al.) — Free online |
| The Site Reliability Workbook | Google (Betsy Beyer et al.) — Free online |
| Building Secure and Reliable Systems | Google (Heather Adkins et al.) — Free online |
| Implementing Service Level Objectives | Alex Hidalgo |
| Observability Engineering | Charity Majors et al. |
All SRE content from these books has been integrated into the SRE Practices guide.