SRE Learning Path
Build and operate reliable, scalable, and efficient systems.
Core SRE Principles
- SLOs, SLIs, SLAs — Define and measure reliability
- Error Budgets — Balance reliability with velocity
- Toil Reduction — Automate operational work
- Incident Management — Respond effectively to outages
- Blameless Postmortems — Learn from failures
Stage 1: Foundations
- Linux system administration
- Networking and distributed systems
- Programming (Python/Go)
- Observability stack
Stage 2: Monitoring & Alerting
- Prometheus metrics and queries
- Grafana dashboards
- Alert design and routing
- On-call best practices
Stage 3: Reliability Practices
- Capacity planning
- Load testing and performance
- Chaos Engineering
- Disaster recovery and DR drills
- Change management
Essential Reading
| Book | Author |
|---|---|
| Site Reliability Engineering | |
| The Site Reliability Workbook | |
| Implementing Service Level Objectives | Alex Hidalgo |
| Observability Engineering | Charity Majors et al. |
Recommended Reading
| Book | Author |
|---|---|
| Site Reliability Engineering | Google (Betsy Beyer et al.) |
| The Site Reliability Workbook | Google (Betsy Beyer et al.) |
| Building Secure and Reliable Systems | Google (Heather Adkins et al.) |
| Training Site Reliability Engineers | |
| Implementing Service Level Objectives | Alex Hidalgo |
All SRE content from these books has been integrated into the SRE Practices guide above.