SRE Practices

Master Site Reliability Engineering to build and operate reliable, scalable, and efficient systems.

What is Site Reliability Engineering?

Site Reliability Engineering (SRE) is an engineering discipline focused on building, maintaining, and evolving systems that are reliable, scalable, and efficient. SRE applies software engineering principles to operations.

SRE Principles

Reliability is a Feature — Reliability is not an afterthought but a primary requirement
Operations as Software Engineering — Use code automation for operational tasks
Measure Everything — Quantify reliability and use data to make decisions
Embrace Risk — Not all reliability is good (cost trade-off)
Sustainable Operations — Operations should be sustainable, not burnout-inducing

SRE Culture

Blameless postmortems — Learn from failures without blaming individuals
Collaboration — Development and operations work together
Automation — Reduce toil through automation
On-call sustainability — On-call work is sustainable, not a punishment
Learning and growth — Invest in team development and skills

SLOs, SLIs, and SLAs

The three pillars of SRE reliability measurement.

SLI (Service Level Indicator)

An SLI is a specific, measurable characteristic of service performance. It's what you actually measure.

Examples:

Percentage of requests successfully completed (success rate)
Percentage of requests completing within latency < 200ms
Percentage of uptime (availability)
Error rate
Data freshness (for data pipelines)
Throughput (requests per second)

Good SLI Properties:

Directly related to user experience
Measurable and quantifiable
Easy to collect data
Reflects what users care about

SLO (Service Level Objective)

An SLO is a target value for an SLI. It's the goal you're trying to meet.

Examples:

99% of requests complete successfully
99.9% of requests complete within 200ms
99.99% availability (52 minutes downtime per year)
95% of users experience error rate < 0.5%
99% data freshness (99% of data < 5 minutes old)

Setting SLOs:

Based on user expectations
Based on business requirements
Based on operational constraints
Based on cost-benefit analysis

SLA (Service Level Agreement)

An SLA is a contractual commitment to users about service levels. Violations can result in penalties.

Example SLA:

"We guarantee 99.95% uptime. If we fall below this, you get 10% credit on monthly bill."

Relationship Between SLI, SLO, SLA

SLI: What you measure (99.8% uptime)
SLO: What you target (99.95% uptime)
SLA: What you guarantee (99.95% uptime with penalties)

Typical: SLI < SLO <= SLA
         (measured result) < (target) <= (guaranteed)

Error Budgets

An error budget is the maximum amount of errors allowed while still meeting your SLO. It's the inverse of your SLO.

Calculating Error Budget

Error Budget = (1 - SLO) * Total Time

Example:
SLO: 99.9% uptime
Error Budget: (1 - 0.999) = 0.001 = 0.1%
Per month (30 days): 0.001 * 30 * 24 * 60 = 43.2 minutes
Per year: 0.001 * 365 * 24 * 60 = 525.6 minutes (8.7 hours)

Using Error Budget

Principle: Once your error budget is consumed, you stop deploying and focus on stability.

Month: January
SLO: 99.9% (43.2 min error budget)
↓
Jan 5: Major outage consumes 30 min
→ Remaining budget: 13.2 min
↓
Jan 20: Want to deploy risky feature
→ Risk assessment: 70% chance of issues
→ Not safe with only 13.2 min budget
→ Postpone deployment until next month
↓
Month: February
→ Error budget resets
→ Deploy risky feature

Error Budget Policies

Conservative: Only deploy when error budget > 50% Moderate: Deploy when budget > 20% Aggressive: Deploy whenever budget > 0%

Benefits of Error Budgets

Aligns incentives — Dev/Ops are motivated by same metric
Enables risk-taking — Safe to experiment when budget allows
Drives reliability investment — Unreliable features consume budget
Fair to all services — Same rules for all teams

Toil and Toil Reduction

Toil is the operational work that doesn't add lasting value. It grows with traffic but doesn't improve your system.

Toil Characteristics

Manual — Repetitive manual work
Automatable — Could be done by a machine
Tactical — Reactive rather than strategic
Non-value-adding — Doesn't improve reliability
Grows linearly — Increases with service size

Examples of Toil

Manually restarting failed services
Responding to the same alert repeatedly
Manually scaling systems
Manually provisioning servers
Copying data between systems by hand
Manual backups

Examples of Non-Toil Work

Building monitoring systems
Writing automation scripts
Designing disaster recovery procedures
Improving system architecture
Reducing alert noise
Capacity planning

Toil Reduction Strategy

Goal: Reduce toil to < 50% of SRE work

SRE Time Allocation:
Toil:           30-50%
On-call:        5-10%
Project work:   30-40%
Training:       5-10%

Approach:

Measure toil — How much time is spent on toil?
Identify toil — What specific tasks are toil?
Automate — Write code to replace manual work
Monitor results — Verify automation worked
Repeat — Continuously reduce toil

Automation Benefits

Faster response (machines are faster than humans)
No human error
Scalable (handle increased load without more people)
Free up humans for higher-value work

Incident Management

Effective incident response minimizes customer impact and recovery time.

Incident Severity

P1 (Critical): Customer-facing service down or severely degraded

Response time: Immediate
All hands on deck
Example: Authentication service down (99% of users affected)

P2 (High): Significant service degradation but workarounds exist

Response time: < 30 minutes
Major incident commander
Example: Payment processing slow but accepting transactions

P3 (Medium): Minor service issues, low user impact

Response time: < 4 hours
Assigned to on-call engineer
Example: Report generation taking longer than normal

P4 (Low): Cosmetic issues, no user impact

Response time: Handle during business hours
Can wait for next sprint
Example: Typo in error message

Incident Response Process

1. DETECTION
   - Monitoring alerts
   - User reports
   - Proactive checks

2. ACKNOWLEDGMENT (P1: < 5 min)
   - On-call engineer acknowledges
   - Incident commander assigned
   - Team assembled

3. INVESTIGATION (P1: 15-60 min)
   - Gather information
   - Check logs, metrics, traces
   - Understand impact scope
   - Identify root cause

4. MITIGATION (P1: 15-60 min total)
   - Implement temporary fix if needed
   - Reduce customer impact
   - Prevent further degradation

5. RESOLUTION (varies)
   - Apply permanent fix
   - Verify systems recovering
   - Monitor for side effects

6. COMMUNICATION
   - Status updates every 30 min
   - Customer-facing communication
   - Stakeholder notifications

7. CLOSEOUT
   - Incident declared resolved
   - Monitoring verified
   - Schedule postmortem

On-Call Best Practices

On-call shifts: 1-2 weeks, not longer
Handoff process: Clear transition between engineers
Clear escalation: Who to contact if primary on-call unreachable
Tool access: On-call engineer has all necessary access
Documentation: Runbooks for common issues
Coverage: Never a single point of failure for on-call

Postmortems

Postmortems are structured retrospectives after incidents focused on learning, not blame.

Postmortem Timeline

Incident occurs → 24-72 hours → Postmortem held → Weeks → Review follow-ups

Postmortem Structure

1. Incident Summary

What happened
When it started and ended
Customer impact
How was it resolved

2. Timeline

05 - Alert fired: High error rate
07 - On-call acknowledged alert
10 - Investigated error logs
15 - Identified database connection pool exhaustion
25 - Restarted database service
35 - Verified system recovered

3. Root Cause Analysis

Why did the database connection pool exhaust?
Why wasn't this caught earlier?
Were there warning signs?

Example Root Cause Chain:

Database connection pool exhausted
  ↑
Due to: New feature using more connections
  ↑
Due to: Feature didn't implement connection pooling
  ↑
Due to: Code review didn't catch this pattern
  ↑
Due to: No automated check for this pattern

4. Contributing Factors

What conditions made this worse?
Were there multiple problems?
What could have reduced impact?

5. Action Items

Add monitoring for connection pool usage
Implement linting rule for connection pooling
Update code review checklist
Load test new features before deployment
Update runbook with mitigation steps

Postmortem Principles

Blameless culture — Focus on systems, not individuals
Psychological safety — People must feel safe speaking up
Actionable items — Create concrete improvements
Follow-up — Track items to completion
Share learnings — Document and share with organization

Capacity Planning

Ensuring systems have sufficient resources to handle current and future load.

Capacity Planning Process

1. MEASURE
   - Current resource usage (CPU, memory, disk, bandwidth)
   - Growth rate (requests/day, data growth)
   - Peak vs average load

2. FORECAST
   - Project future usage
   - Account for seasonal peaks
   - Plan for growth scenarios

3. PLAN
   - When will you hit 70% capacity?
   - When will you hit 90% capacity?
   - How long to provision new capacity?

4. PROVISION
   - Add resources before hitting limits
   - Test systems with new capacity
   - Verify performance improvement

5. MONITOR
   - Track actual vs projected usage
   - Adjust forecasts as needed
   - Plan next capacity addition

Capacity Metrics

CPU utilization — Target: < 70% peak, < 50% average
Memory utilization — Target: < 80%
Disk usage — Target: < 80% (plan for 2x growth)
Network bandwidth — Plan for peak load + headroom
Database connections — Monitor pool exhaustion

Load Testing

Test systems at expected load:

# Simulate 10,000 concurrent users for 1 hour
$ load-test --concurrency=10000 --duration=3600 https://example.com

# Results:
# Average response time: 150ms
# P99 response time: 750ms
# Error rate: 0.01%
# Requests/sec: 8,500

Best Practices

1. Measure Reliability

Define SLIs that matter to users
Set realistic SLOs
Track error budgets
Monitor actual reliability

2. Automate Toil

Document manual processes
Identify candidates for automation
Build tools to replace manual work
Continuously reduce toil

3. Prevent Incidents

Implement comprehensive monitoring
Alert on symptoms, not just metrics
Build in redundancy and graceful degradation
Test failure scenarios regularly

4. Respond Effectively

Have clear incident response process
Train team on incident response
Use runbooks for common issues
Conduct incident drills

5. Learn from Failures

Hold blameless postmortems
Track action items to completion
Share learnings with organization
Update systems based on learnings

6. Balance Reliability and Velocity

Use error budgets to guide deployment decisions
Don't over-invest in reliability (cost trade-off)
Enable risk-taking when budget allows
Make speed and reliability compatible

Exercises

Exercise 1: Define SLOs

For a web application, define:

Three meaningful SLIs
SLOs for each SLI
Rationale for each SLO choice
Error budget for each

Exercise 2: Root Cause Analysis

Given an incident description:

Construct timeline of events
Identify immediate cause
Find root cause through 5 Whys
List contributing factors
Propose preventative action items

Exercise 3: Capacity Planning

For a service with current metrics:

Project usage 3, 6, 12 months out
Determine when to provision new capacity
Plan resource additions
Calculate costs of scaling

Exercise 4: Automation Project

Identify toil in your operations:

What manual tasks repeat regularly?
Which have highest impact if automated?
Create plan to automate top 3 tasks
Estimate time saved

Exercise 5: Postmortem Facilitation

Practice writing and facilitating a postmortem:

Write timeline of incident
Identify root cause
Create action items
Present to team without blame

Key Takeaways

SRE combines software engineering with operations to build reliable systems
SLIs measure what matters, SLOs set targets, and error budgets guide decisions
Error budgets align incentives between development and operations
Toil reduction frees up humans for higher-value work
Effective incident management minimizes customer impact and recovery time
Blameless postmortems drive learning and continuous improvement
Capacity planning prevents performance degradation under growth
Automation and monitoring are critical to SRE success
Reliability is a feature requiring intentional design and investment

What is Site Reliability Engineering?​

SRE Principles​

SRE Culture​

SLOs, SLIs, and SLAs​

SLI (Service Level Indicator)​

SLO (Service Level Objective)​

SLA (Service Level Agreement)​

Relationship Between SLI, SLO, SLA​

Error Budgets​

Calculating Error Budget​

Using Error Budget​

Error Budget Policies​

Benefits of Error Budgets​

Toil and Toil Reduction​

Toil Characteristics​

Examples of Toil​

Examples of Non-Toil Work​

Toil Reduction Strategy​

Automation Benefits​

Incident Management​

Incident Severity​

Incident Response Process​

On-Call Best Practices​

Postmortems​

Postmortem Timeline​

Postmortem Structure​

Postmortem Principles​

Capacity Planning​

Capacity Planning Process​

Capacity Metrics​

Load Testing​

Best Practices​

1. Measure Reliability​

2. Automate Toil​

3. Prevent Incidents​

4. Respond Effectively​

5. Learn from Failures​

6. Balance Reliability and Velocity​

Exercises​

Exercise 1: Define SLOs​

Exercise 2: Root Cause Analysis​

Exercise 3: Capacity Planning​

Exercise 4: Automation Project​

Exercise 5: Postmortem Facilitation​

Key Takeaways​