Key Responsibilities:

• Monitor and maintain the availability of key services and applications.

• Participate in defining and improving SLIs, SLOs, and SLAs for system reliability.

• Identify and resolve performance bottlenecks and system inefficiencies.

• Assist in incident response, troubleshooting production issues, and conducting root

cause analysis (RCA).

• Work on improving monitoring, logging, and alerting systems using tools like

Prometheus, Grafana, and Elastic APM.

• Participate in on-call rotations and incident handling.

• Write and debug Java-based applications to enhance system reliability.

• Analyze logs, troubleshoot performance issues, and optimize Java services.

• Gain hands-on experience with JVM monitoring, thread dumps, and heap

analysis.

• Work closely with developers to improve the reliability of Java applications.

• Work with infrastructure as code (IaC) using Helm, or Ansible.

• Optimize system configurations for scalability and reliability.

• Automate operational tasks to improve system efficiency.

• Work closely with software engineers and senior SREs to enhance system reliability.

• Continuously develop knowledge in cloud computing (AWS, Azure, GCP), Kubernetes, and DevOps practices.

Skills & Qualifications

• Strong Java programming and debugging skills (must-have).

• Experience with Linux systems, networking, and cloud platforms (AWS, Azure,

or GCP).

• Familiarity with monitoring tools like Prometheus, Grafana, or New Relic.

• Experience troubleshooting and analyzing Java application performance.

• Strong problem-solving skills and ability to analyze system issues.

• Scripting ability in Python, Bash, or Go for automation.

• Exposure to Kubernetes and containerization concepts.

• Experience with infrastructure-as-code tools like Terraform or Ansible.