Key Responsibilities

Monitor availability and performance of production SaaS infrastructure hosted on AWS
Drive capacity planning and reliability improvements for cloud services
Own end-to-end incident management including emergency response, mitigation, and root cause analysis (RCA)
Build and contribute to platforms for full observability and automated incident response
Collaborate with DevOps, InfoSec, and Engineering teams to improve system reliability
Analyze performance metrics to identify bottlenecks and areas for improvement

Requirements

Pursuing or holding a Bachelor's degree in Computer Science, Information Technology, or related engineering discipline
Hands-on experience with AWS services (EC2, RDS, Elasticsearch, ECS, Redis, SQS, Lambda, API Gateway, VPCs)
Familiarity with monitoring tools (NewRelic, Grafana, Loggly, PagerDuty, Site24x7, Kibana)
Strong problem-solving skills and ability to work in fast-paced environments
Basic scripting knowledge in Python or equivalent

Intern - SRE - JobLuxe

View Assessment Process