Key Responsibilities
- Drive stability, resiliency, and performance of mission-critical platforms supporting financial services
- Define and enforce reliability standards, SLIs, SLOs, and error budgets for cloud and on-premise environments
- Lead incident management, automation, and observability initiatives to reduce toil and improve system reliability
- Collaborate with cross-functional teams to implement scalable, secure, and high-performance solutions
- Mentor peers and provide technical leadership on reliability engineering best practices
- Analyze complex technical challenges and recommend innovations to enhance operations
Requirements
- 5+ years of software engineering experience with a focus on observability and monitoring tools (Splunk, AppDynamics, Grafana, OpenTelemetry)
- 5+ years in infrastructure support (Windows and Linux) and cloud application management (OpenShift, Kubernetes)
- Proven success in toil reduction initiatives and capacity optimization for mission-critical applications
- Strong expertise in scripting (Python, Shell) and performance tuning (Java heap, GC)
- Ability to engage with stakeholders at all levels and influence technical decisions