Key Responsibilities

Drive stability, resiliency, and performance of mission-critical platforms supporting financial services
Define and enforce reliability standards, SLIs, SLOs, and error budgets for cloud and on-premise environments
Lead incident management, automation, and observability initiatives to reduce toil and improve system reliability
Collaborate with cross-functional teams to implement scalable, secure, and high-performance solutions
Mentor peers and provide technical leadership on reliability engineering best practices
Analyze complex technical challenges and recommend innovations to enhance operations

Requirements

5+ years of software engineering experience with a focus on observability and monitoring tools (Splunk, AppDynamics, Grafana, OpenTelemetry)
5+ years in infrastructure support (Windows and Linux) and cloud application management (OpenShift, Kubernetes)
Proven success in toil reduction initiatives and capacity optimization for mission-critical applications
Strong expertise in scripting (Python, Shell) and performance tuning (Java heap, GC)
Ability to engage with stakeholders at all levels and influence technical decisions

Lead Software Engineer - SRE

View Assessment Process