Key Responsibilities
- Lead incident response and post-mortem analysis for critical systems
- Design and implement scalable reliability solutions
- Automate infrastructure and deployment pipelines
- Monitor system performance and proactively address issues
- Collaborate with engineering teams to improve system resilience
- Document incident response protocols and best practices
Requirements
- 5+ years of experience in site reliability or incident response
- Expertise in Kubernetes, Docker, and cloud infrastructure
- Strong scripting and automation skills
- Experience with Terraform and infrastructure-as-code
- Excellent troubleshooting and communication skills