Key Responsibilities
- Design and implement full-stack observability by evaluating and improving monitoring and metrics solutions
- Lead blameless incident response and post-mortems to enhance system reliability
- Mentor engineers in logging, monitoring, and reliability best practices
- Define and track KPIs for platform reliability and performance with engineering leadership
- Deploy infrastructure updates using Terraform on AWS
- Build proofs of concept for logging and metrics across frameworks and languages
Requirements
- Bachelor’s degree required
- Five years of experience as a Site Reliability Engineer, DevOps Engineer, or similar role
- Strong experience with Infrastructure as Code, specifically Terraform
- Hands-on experience managing cloud infrastructure in AWS
- Knowledge of monitoring, logging, and observability tools