Key Responsibilities
- Monitor availability and performance of production SaaS infrastructure hosted on AWS
- Drive capacity planning and reliability improvements for cloud services
- Own end-to-end incident management including emergency response, mitigation, and root cause analysis (RCA)
- Build and contribute to platforms for full observability and automated incident response
- Collaborate with DevOps, InfoSec, and Engineering teams to improve system reliability
- Analyze performance metrics to identify bottlenecks and areas for improvement
Requirements
- Pursuing or holding a Bachelor's degree in Computer Science, Information Technology, or related engineering discipline
- Hands-on experience with AWS services (EC2, RDS, Elasticsearch, ECS, Redis, SQS, Lambda, API Gateway, VPCs)
- Familiarity with monitoring tools (NewRelic, Grafana, Loggly, PagerDuty, Site24x7, Kibana)
- Strong problem-solving skills and ability to work in fast-paced environments
- Basic scripting knowledge in Python or equivalent