Skip to main content

Observability & SRE

Prometheus, Grafana, and SRE practices for reliable systems.

Core Concepts

ConceptDescription
MonitoringCollecting, processing, and aggregating metrics to understand system health
ObservabilityAbility to understand internal state from external outputs (metrics, logs, traces)
AlertingNotifying humans when systems need attention
SLOs/SLIsMeasuring and targeting reliability with Service Level Objectives and Indicators
Incident ManagementDetecting, responding to, and learning from outages

Key Tools

ToolPurposeLink
PrometheusMetrics collection and alertingprometheus.io
GrafanaDashboards and visualizationgrafana.com
AlertmanagerAlert routing, grouping, and silencingprometheus.io/docs/alerting
JaegerDistributed tracingjaegertracing.io
OpenTelemetryUnified observability frameworkopentelemetry.io
PagerDutyIncident response and on-callpagerduty.com

Google SRE Book — Observability Chapters

These chapters from the free Google SRE Book are directly relevant to observability and monitoring:

ChTitleWhy It Matters
4Service Level ObjectivesFoundation for what to monitor and alert on
6Monitoring Distributed SystemsCore monitoring philosophy and approach
10Practical AlertingHow to build effective alerting systems
12Effective TroubleshootingUsing observability data to diagnose issues
15Postmortem CultureLearning from failures to improve monitoring
16Tracking OutagesAggregating and analyzing incident data
17Testing for ReliabilityValidating monitoring and alerting in production

See the full SRE Learning Path for all 34 chapters.

Further Reading

Contributing

Know great Observability & SRE resources? Submit a PR to help the community learn!