Job Title: Site Reliability Engineer Department: Infrastructure & Support

Concept: Landmark Digital Job Code/

Req#:

Location: Bengaluru, India Travel

Required:

As required

Reporting to: Associate General Manager -

Product Analytics

Position Type: Full-time

Landmark Digital

Meet Landmark Digital – we’re part of the Landmark Group, one of the largest retail and hospitality organisations in

the Middle East, North Africa and India. We’re guardians of the group’s digital arm which encompasses Enterprise

and E-commerce Tech, Product Management, User Design, Omni Operations, Customer Experience, Loyalty,

Content Production, Studio, Growth and MarTech, Finance and HR functions. With a futuristic outlook we strive to

make the digital experience of our customers seamless.

Headquartered in Dubai, UAE, we’re currently driving the digital experience for eight industry-leading brands in eight

geographies,and rapidly expanding our footprint across new territories and functions. Join us, and you’ll be part of the

Middle East’s biggest bricks-to-clicks success story, that registers over 100% growth year-on-year.

Agile Work Culture

Within the digital function, you’ll be hands-on from day one, working in squads to make independent decisions and

game-changing contributions that directly impact millions of customers. You’ll collaborate every day with teammates

from 20+ nationalities, across Dubai, India, Europe and the US.

Space for Excellence

Landmark Digital has a dedicated and growing software development centre in Bangalore, India, where we incubate,

design and optimise our products and experiences. We also offer the option to work remotely for certain key roles.

Focus on Learning & Growth

Excellent remuneration and perks are part of the package, but we also budget ample time and resources for training

and upskilling.

Fun at Work

You’ll enjoy our well-equipped break-rooms (FIFA and pool are huge favourites), team events, celebrations and a

host of other amenities.

Job Purpose

We’re looking for a Site Reliability Engineer (SRE) with strong technical and analytical skills to ensure the reliability,

scalability, and performance of our core applications. This role focuses on improving the stability and efficiency of

distributed systems built on Java and microservices architecture, driving operational excellence through

monitoring, automation, and incident management.

You’ll be part of the team that keeps our business-critical systems healthy — investigating production issues,

implementing preventive measures, and collaborating with engineering teams to improve observability and resiliency.

Job Description

Application Reliability & Performance

 Monitor and maintain the health, performance, and reliability of production applications.

 Define, measure, and track SLIs/SLOs for key services, driving improvements proactively.

 Identify performance bottlenecks, memory leaks, and slow transactions in Java-based microservices.

 Partner with development teams to design and deploy resilient, fault-tolerant systems.

 Mentor developers and operations engineers on observability and debugging techniques.

Incident Management & Troubleshooting

 Actively participate in incident response, triaging application issues, and restoring services quickly.

 Perform deep root-cause analysis for recurring incidents and ensure permanent fixes are implemented.

 Own the incident lifecycle — from detection to resolution and post-incident review.

 Ensure observability tools and alert thresholds are tuned to reduce false positives and improve signal quality.

Monitoring & Automation

 Enhance visibility across systems through better metrics, logs, and traces using Prometheus, Grafana, and

Loki (or similar).

Public

 Automate repetitive tasks — deployments, rollbacks, scaling, and diagnostics.

 Build or improve runbooks and self-healing mechanisms to reduce operational toil.

 Integrate AIOps capabilities for smarter alert correlation, anomaly detection, and incident prediction.

Operational Ownership

 Ensure production systems meet availability and performance targets.

 Track open issues, follow up on root cause actions, and drive closure with responsible teams.

 Collaborate with developers, infrastructure, and QA to maintain a consistent and stable release cycle.

 Contribute to continuous improvement of deployment, monitoring, and rollback processes.

Collaboration & Communication

 Work closely with product and platform engineering to integrate reliability into system design.

 Communicate incident status, RCA findings, and reliability metrics to stakeholders.

Foster a reliability-first culture and advocate for operational excellence across teams

Knowledge, Skills & Experience

Job Experience 2–5 years of experience in Site Reliability Engineering or Application

Operations

Required Skills  2–5 years of experience in Site Reliability Engineering or Application

Operations.

 Solid understanding of Java, Springboot and microservices

architecture.

 Proficiency in monitoring and observability tools (Prometheus,

Grafana, Loki, New Relic, or equivalent).

 Familiarity with Kubernetes, containers, and CI/CD pipelines.

 Familiarity with incident management, RCA, and performance

debugging.

 Experience with cloud platforms (AWS, Azure, or GCP).

 Strong scripting skills (Bash, Python, or Go) for automation and

diagnostics.

 Good communication and stakeholder collaboration skills.

 Experience with modern observability tools

 Familiarity with ITSM or ticketing tools (Jira, ServiceNow) for issue

tracking.

 Knowledge of security and compliance in production environments.

 Hands-on experience with JVM performance tuning and runtime

diagnostics to improve Java service performance.

 Exposure to using AI or LLM-based tools for alert correlation, log

analysis, root-cause detection, or automated diagnostics.

SRE L1 Support

View Assessment Process

Think you'll be a good fit?