Key Responsibilities:
System Reliability & Performance
• Monitor and maintain the availability of key services and applications.
• Participate in defining and improving SLIs, SLOs, and SLAs for system reliability.
• Identify and resolve performance bottlenecks and system inefficiencies.
Incident Management & Monitoring
• Assist in incident response, troubleshooting production issues, and conducting root
cause analysis (RCA).
• Work on improving monitoring, logging, and alerting systems using tools like
Prometheus, Grafana, and Elastic APM.
• Participate in on-call rotations and incident handling.
Java Coding & Debugging
• Write and debug Java-based applications to enhance system reliability.
• Analyze logs, troubleshoot performance issues, and optimize Java services.
• Gain hands-on experience with JVM monitoring, thread dumps, and heap
analysis.
• Work closely with developers to improve the reliability of Java applications.
Automation & Infrastructure
• Work with infrastructure as code (IaC) using Helm, or Ansible.
• Optimize system configurations for scalability and reliability.
• Automate operational tasks to improve system efficiency.
Collaboration & Learning
• Work closely with software engineers and senior SREs to enhance system reliability.
• Continuously develop knowledge in cloud computing (AWS, Azure, GCP), Kubernetes, and DevOps practices.
Skills & Qualifications
Required Skills
• Strong Java programming and debugging skills (must-have).
• Experience with Linux systems, networking, and cloud platforms (AWS, Azure,
or GCP).
• Familiarity with monitoring tools like Prometheus, Grafana, or New Relic.
• Experience troubleshooting and analyzing Java application performance.
• Strong problem-solving skills and ability to analyze system issues.
Preferred Skills (Nice to Have)
• Scripting ability in Python, Bash, or Go for automation.
• Exposure to Kubernetes and containerization concepts.
• Experience with infrastructure-as-code tools like Terraform or Ansible.