SRE Interview Questions & Answers 2025 | Site Reliability Engineer Prep

Site Reliability Engineer Interview Questions

1. Explain SLI, SLO, and SLA. How do you define and measure them for a web service?

Expert Answer: SLI (Service Level Indicator) is a quantitative metric of service behavior—e.g., request latency, error rate, throughput. SLO (Service Level Objective) is target SLI value or range—e.g., "99.9% of requests complete in <200ms." SLA (Service Level Agreement) is business contract with consequences for missing SLOs—typically stricter than internal SLO. For web service: SLI = successful HTTP requests / total requests, measured via load balancer logs. SLO = 99.95% success rate over 30-day window. Calculate error budget: (1 - 0.9995) × total requests. When budget exhausted, prioritize reliability over features. Track using monitoring tools, visualize in dashboards, alert when approaching limits.

2. How do you design a system to achieve 99.99% uptime (four nines reliability)?

Expert Answer: Four nines = 52.6 minutes downtime/year. Architecture: (1) Eliminate single points of failure—redundancy across zones/regions; (2) Load balancing with health checks; (3) Database replication (read replicas, multi-region); (4) Graceful degradation—serve cached/stale data when dependencies fail; (5) Circuit breakers prevent cascading failures; (6) Automated failover with monitoring; (7) Blue-green deployments minimize deployment risk; (8) Comprehensive monitoring and alerting; (9) Disaster recovery plan with regular drills; (10) Capacity planning for traffic spikes. Test failure scenarios regularly (chaos engineering). Document recovery procedures. Calculate cost vs reliability tradeoffs—five nines may not be business-justified.

3. What's the difference between monitoring and observability? How do you implement both?

Expert Answer: Monitoring answers known questions—predefined metrics, dashboards, alerts for expected failure modes (CPU, memory, error rates). Observability answers unknown questions—ability to understand system behavior from outputs, debug novel problems. Three pillars: (1) Metrics—time-series data (Prometheus, Datadog), quantitative; (2) Logs—event records (ELK, Splunk), contextual; (3) Traces—request flow (Jaeger, Zipkin), distributed context. Implementation: instrument code with OpenTelemetry, collect all three signal types, correlate with trace IDs, query-based exploration. SRE approach: golden signals (latency, traffic, errors, saturation), RED method (Rate, Errors, Duration), USE method (Utilization, Saturation, Errors). Observability enables debugging production issues without deploying new code.

4. How do you design effective alerting without alert fatigue? What makes a good alert?

Expert Answer: Good alert indicates actionable, customer-impacting issue requiring immediate human intervention. Principles: (1) Alert on symptoms (user-facing) not causes—e.g., "high error rate" not "disk full"; (2) Use SLO-based alerting—trigger when error budget burn rate too high; (3) Multi-window alerting—short window (5min) for fast response + long window (1hr) to avoid flapping; (4) Severity levels—P0 (page), P1 (ticket), P2 (investigate later); (5) Runbook links—every alert has investigation steps; (6) Alert tuning—regularly review alert quality, adjust thresholds. Avoid: alerting on information-only events, redundant alerts, unconfigured defaults. Measure: alert precision (actionable/total), time to resolution, alert volume per shift.

5. Walk me through your incident response process from detection to resolution.

Expert Answer: Process stages: (1) **Detection**—alert fires or user report; (2) **Triage**—assess severity, determine if incident, page on-call; (3) **Investigation**—incident commander coordinates, gather logs/metrics, form hypothesis; (4) **Mitigation**—temporary fix to restore service (rollback, failover, scale resources); (5) **Resolution**—permanent fix deployed; (6) **Recovery**—verify service restored, clear alert; (7) **Postmortem**—blameless analysis within 48hrs, document timeline, root cause, action items. Communication: status page updates, stakeholder notifications. Roles: incident commander (coordinates), subject matter experts (investigate), communication lead (updates). Track: MTTA (acknowledge), MTTI (identify), MTTR (repair). Tools: PagerDuty, Slack war rooms, incident.io.

6. Explain blameless postmortems. What makes a good postmortem?

Expert Answer: Blameless postmortems focus on systems and processes, not individuals, to encourage learning without fear. Good postmortem includes: (1) **Timeline**—chronological events with timestamps; (2) **Impact**—duration, affected users, revenue loss; (3) **Root cause**—deep analysis (Five Whys), not surface symptoms; (4) **Contributing factors**—multiple causes, not single point; (5) **Action items**—specific, assigned, tracked (prevent, detect, mitigate); (6) **What went well**—positive aspects to reinforce. Format: shared widely, reviewed in team meetings, tracked in central repository. Key principle: human error is symptom of system problem—fix systems, not blame people. Examples: add monitoring, improve alerting, automate runbooks, improve documentation. Follow-up: review action items quarterly, measure incident recurrence.

7. How do you implement distributed tracing for a microservices architecture?

Expert Answer: Distributed tracing tracks requests across service boundaries. Implementation: (1) Generate unique trace ID at entry point; (2) Propagate trace ID in request headers (W3C Trace Context); (3) Create span for each operation with start/end times; (4) Include metadata (service name, operation, tags); (5) Report spans to collector (Jaeger, Zipkin, Tempo); (6) Visualize as flame graphs. Instrumentation: use OpenTelemetry SDKs (language-agnostic), auto-instrument frameworks (Spring, Flask), manually instrument custom code. Sampling strategies: head-based (sample at entry), tail-based (sample after seeing full trace), rate limiting for high-traffic services. Use cases: latency debugging, dependency mapping, error investigation, performance optimization. Critical for understanding request flow in distributed systems.

8. What is chaos engineering? How would you implement it safely?

Expert Answer: Chaos engineering proactively injects failures to validate system resilience and uncover weaknesses before production incidents. Implementation steps: (1) **Steady state**—define normal metrics (SLIs); (2) **Hypothesis**—predict system behavior under failure; (3) **Inject failure**—controlled experiment (kill instance, add latency, fill disk); (4) **Monitor impact**—compare metrics to steady state; (5) **Learn and improve**—fix issues found, document blast radius. Start small: non-production, single service, low-severity failures. Tools: Chaos Monkey (random instance termination), Gremlin (controlled experiments), Litmus Chaos (Kubernetes). Safety: automated rollback, gradually increase scope (GameDays), informed stakeholders, monitoring ready. Benefits: validates disaster recovery, builds muscle memory, surfaces hidden dependencies. Culture: embrace failure as learning opportunity.

9. Explain Infrastructure as Code (IaC). Compare Terraform, CloudFormation, and Ansible.

Expert Answer: IaC manages infrastructure through version-controlled configuration files, enabling reproducibility and automation. **Terraform**: declarative, cloud-agnostic (AWS, GCP, Azure), large provider ecosystem, state management, plan/apply workflow. Best for: multi-cloud, reusable modules. **CloudFormation**: AWS-native, declarative, integrated with AWS services, ChangeSet preview, automatic rollback. Best for: AWS-only environments, deep AWS integration. **Ansible**: procedural (imperative), configuration management + provisioning, agentless (SSH), YAML playbooks. Best for: server configuration, app deployment. Key IaC principles: version control (Git), code review for infrastructure changes, automated testing, immutable infrastructure, avoid manual changes. Use Terraform for infrastructure provisioning, Ansible for configuration management, combination common.

10. How would you design and implement a CI/CD pipeline with reliability best practices?

Expert Answer: Reliable CI/CD pipeline stages: (1) **Source**—Git trigger, branch protection; (2) **Build**—containerized build environment, dependency caching, fail fast; (3) **Test**—unit tests (fast feedback), integration tests, contract tests, security scanning; (4) **Stage deployment**—deploy to staging environment identical to production; (5) **Pre-production validation**—smoke tests, load tests, security scans; (6) **Production deployment**—rolling deployment, canary releases (5% → 50% → 100%); (7) **Post-deployment**—automated health checks, rollback on failure, monitoring alerts. Reliability practices: immutable artifacts, database migrations as separate pipeline, secrets management (Vault, AWS Secrets Manager), audit logging, deployment approvals for production. Tools: Jenkins, GitLab CI, GitHub Actions, Spinnaker for deployments. Measure: deployment frequency, lead time, change failure rate, MTTR.

Site Reliability Engineer Interview Questions

Related Interview Guides

DevOps Engineer

Software Engineer

Backend Developer

Machine Learning Engineer