Observability & SRE

Prometheus, Grafana, and SRE practices for reliable systems.

Core Concepts

Concept	Description
Monitoring	Collecting, processing, and aggregating metrics to understand system health
Observability	Ability to understand internal state from external outputs (metrics, logs, traces)
Alerting	Notifying humans when systems need attention
SLOs/SLIs	Measuring and targeting reliability with Service Level Objectives and Indicators
Incident Management	Detecting, responding to, and learning from outages

Tool	Purpose	Link
Prometheus	Metrics collection and alerting	prometheus.io
Grafana	Dashboards and visualization	grafana.com
Alertmanager	Alert routing, grouping, and silencing	prometheus.io/docs/alerting
Jaeger	Distributed tracing	jaegertracing.io
OpenTelemetry	Unified observability framework	opentelemetry.io
PagerDuty	Incident response and on-call	pagerduty.com

These chapters from the free Google SRE Book are directly relevant to observability and monitoring:

Ch	Title	Why It Matters
4	Service Level Objectives	Foundation for what to monitor and alert on
6	Monitoring Distributed Systems	Core monitoring philosophy and approach
10	Practical Alerting	How to build effective alerting systems
12	Effective Troubleshooting	Using observability data to diagnose issues
15	Postmortem Culture	Learning from failures to improve monitoring
16	Tracking Outages	Aggregating and analyzing incident data
17	Testing for Reliability	Validating monitoring and alerting in production

See the full SRE Learning Path for all 34 chapters.

Know great Observability & SRE resources? Submit a PR to help the community learn!