Skip to main content

SRE Learning Path

Build and operate reliable, scalable, and efficient systems.

Core SRE Principles

  • SLOs, SLIs, SLAs — Define and measure reliability
  • Error Budgets — Balance reliability with velocity
  • Toil Reduction — Automate operational work
  • Incident Management — Respond effectively to outages
  • Blameless Postmortems — Learn from failures

Stage 1: Foundations

  • Linux system administration
  • Networking and distributed systems
  • Programming (Python/Go)
  • Observability stack

Stage 2: Monitoring & Alerting

  • Prometheus metrics and queries
  • Grafana dashboards
  • Alert design and routing
  • On-call best practices

Stage 3: Reliability Practices

  • Capacity planning
  • Load testing and performance
  • Chaos Engineering
  • Disaster recovery and DR drills
  • Change management

Essential Reading

BookAuthor
Site Reliability EngineeringGoogle
The Site Reliability WorkbookGoogle
Implementing Service Level ObjectivesAlex Hidalgo
Observability EngineeringCharity Majors et al.
BookAuthor
Site Reliability EngineeringGoogle (Betsy Beyer et al.)
The Site Reliability WorkbookGoogle (Betsy Beyer et al.)
Building Secure and Reliable SystemsGoogle (Heather Adkins et al.)
Training Site Reliability EngineersGoogle
Implementing Service Level ObjectivesAlex Hidalgo

All SRE content from these books has been integrated into the SRE Practices guide above.