Observability & SRE
Prometheus, Grafana, and SRE practices for reliable systems.
Core Concepts
| Concept | Description |
|---|---|
| Monitoring | Collecting, processing, and aggregating metrics to understand system health |
| Observability | Ability to understand internal state from external outputs (metrics, logs, traces) |
| Alerting | Notifying humans when systems need attention |
| SLOs/SLIs | Measuring and targeting reliability with Service Level Objectives and Indicators |
| Incident Management | Detecting, responding to, and learning from outages |
Key Tools
| Tool | Purpose | Link |
|---|---|---|
| Prometheus | Metrics collection and alerting | prometheus.io |
| Grafana | Dashboards and visualization | grafana.com |
| Alertmanager | Alert routing, grouping, and silencing | prometheus.io/docs/alerting |
| Jaeger | Distributed tracing | jaegertracing.io |
| OpenTelemetry | Unified observability framework | opentelemetry.io |
| PagerDuty | Incident response and on-call | pagerduty.com |
Google SRE Book — Observability Chapters
These chapters from the free Google SRE Book are directly relevant to observability and monitoring:
| Ch | Title | Why It Matters |
|---|---|---|
| 4 | Service Level Objectives | Foundation for what to monitor and alert on |
| 6 | Monitoring Distributed Systems | Core monitoring philosophy and approach |
| 10 | Practical Alerting | How to build effective alerting systems |
| 12 | Effective Troubleshooting | Using observability data to diagnose issues |
| 15 | Postmortem Culture | Learning from failures to improve monitoring |
| 16 | Tracking Outages | Aggregating and analyzing incident data |
| 17 | Testing for Reliability | Validating monitoring and alerting in production |
See the full SRE Learning Path for all 34 chapters.
Further Reading
- SRE Learning Path — Complete roadmap with all Google SRE Book chapters
- SRE Practices Guide — Deep dive into SLOs, incident management, postmortems
Contributing
Know great Observability & SRE resources? Submit a PR to help the community learn!