Key Responsibilities
- Define and operationalize Service Level Objectives (SLOs) and error budgets in production environments
- Design and implement reliability-focused monitoring, alerting, and incident response systems
- Conduct chaos engineering experiments to identify and mitigate system vulnerabilities
- Collaborate with development teams to improve system resilience and observability
- Optimize infrastructure performance and cost-efficiency through automation and best practices
- Participate in on-call rotations and post-mortem analysis to drive continuous improvement
Requirements
- 3+ years of experience in site reliability engineering or production operations
- Hands-on experience with SLOs, error budgets, and reliability metrics
- Familiarity with chaos engineering tools and methodologies
- Strong scripting and automation skills (Python, Bash, etc.)
- Experience with cloud platforms (AWS, GCP, Azure) and container orchestration