Job Title: Site Reliability Engineer Department: Infrastructure & Support
Concept: Landmark Digital Job Code/
Req#:
NA
Location: Bengaluru, India Travel
Required:
As required
Reporting to: Associate General Manager -
Product Analytics
Position Type: Full-time
Landmark Digital
Meet Landmark Digital – we’re part of the Landmark Group, one of the largest retail and hospitality organisations in
the Middle East, North Africa and India. We’re guardians of the group’s digital arm which encompasses Enterprise
and E-commerce Tech, Product Management, User Design, Omni Operations, Customer Experience, Loyalty,
Content Production, Studio, Growth and MarTech, Finance and HR functions. With a futuristic outlook we strive to
make the digital experience of our customers seamless.
Headquartered in Dubai, UAE, we’re currently driving the digital experience for eight industry-leading brands in eight
geographies,and rapidly expanding our footprint across new territories and functions. Join us, and you’ll be part of the
Middle East’s biggest bricks-to-clicks success story, that registers over 100% growth year-on-year.
Agile Work Culture
Within the digital function, you’ll be hands-on from day one, working in squads to make independent decisions and
game-changing contributions that directly impact millions of customers. You’ll collaborate every day with teammates
from 20+ nationalities, across Dubai, India, Europe and the US.
Space for Excellence
Landmark Digital has a dedicated and growing software development centre in Bangalore, India, where we incubate,
design and optimise our products and experiences. We also offer the option to work remotely for certain key roles.
Focus on Learning & Growth
Excellent remuneration and perks are part of the package, but we also budget ample time and resources for training
and upskilling.
Fun at Work
You’ll enjoy our well-equipped break-rooms (FIFA and pool are huge favourites), team events, celebrations and a
host of other amenities.
Job Purpose
We’re looking for a Site Reliability Engineer (SRE) with strong technical and analytical skills to ensure the reliability,
scalability, and performance of our core applications. This role focuses on improving the stability and efficiency of
distributed systems built on Java and microservices architecture, driving operational excellence through
monitoring, automation, and incident management.
You’ll be part of the team that keeps our business-critical systems healthy — investigating production issues,
implementing preventive measures, and collaborating with engineering teams to improve observability and resiliency.
Job Description
Application Reliability & Performance
Monitor and maintain the health, performance, and reliability of production applications.
Define, measure, and track SLIs/SLOs for key services, driving improvements proactively.
Identify performance bottlenecks, memory leaks, and slow transactions in Java-based microservices.
Partner with development teams to design and deploy resilient, fault-tolerant systems.
Mentor developers and operations engineers on observability and debugging techniques.
Incident Management & Troubleshooting
Actively participate in incident response, triaging application issues, and restoring services quickly.
Perform deep root-cause analysis for recurring incidents and ensure permanent fixes are implemented.
Own the incident lifecycle — from detection to resolution and post-incident review.
Ensure observability tools and alert thresholds are tuned to reduce false positives and improve signal quality.
Monitoring & Automation
Enhance visibility across systems through better metrics, logs, and traces using Prometheus, Grafana, and
Loki (or similar).
Public
Automate repetitive tasks — deployments, rollbacks, scaling, and diagnostics.
Build or improve runbooks and self-healing mechanisms to reduce operational toil.
Integrate AIOps capabilities for smarter alert correlation, anomaly detection, and incident prediction.
Operational Ownership
Ensure production systems meet availability and performance targets.
Track open issues, follow up on root cause actions, and drive closure with responsible teams.
Collaborate with developers, infrastructure, and QA to maintain a consistent and stable release cycle.
Contribute to continuous improvement of deployment, monitoring, and rollback processes.
Collaboration & Communication
Work closely with product and platform engineering to integrate reliability into system design.
Communicate incident status, RCA findings, and reliability metrics to stakeholders.
Foster a reliability-first culture and advocate for operational excellence across teams
Knowledge, Skills & Experience
Job Experience 2–5 years of experience in Site Reliability Engineering or Application
Operations
Required Skills 2–5 years of experience in Site Reliability Engineering or Application
Operations.
Solid understanding of Java, Springboot and microservices
architecture.
Proficiency in monitoring and observability tools (Prometheus,
Grafana, Loki, New Relic, or equivalent).
Familiarity with Kubernetes, containers, and CI/CD pipelines.
Familiarity with incident management, RCA, and performance
debugging.
Experience with cloud platforms (AWS, Azure, or GCP).
Strong scripting skills (Bash, Python, or Go) for automation and
diagnostics.
Good communication and stakeholder collaboration skills.
Experience with modern observability tools
Familiarity with ITSM or ticketing tools (Jira, ServiceNow) for issue
tracking.
Knowledge of security and compliance in production environments.
Hands-on experience with JVM performance tuning and runtime
diagnostics to improve Java service performance.
Exposure to using AI or LLM-based tools for alert correlation, log
analysis, root-cause detection, or automated diagnostics.