CoreWeave is The Essential Cloud for AI™. Built for pioneers by pioneers, CoreWeave delivers a platform of technology, tools, and teams that enables innovators to build and scale AI with confidence. Trusted by leading AI labs, startups, and global enterprises, CoreWeave combines superior infrastructure performance with deep technical expertise to accelerate breakthroughs and turn compute into capability. Founded in 2017, CoreWeave became a publicly traded company (Nasdaq: CRWV) in March 2025. Learn more at www.coreweave.com .
What You’ll Do:
The Global Command Center is the nerve center of CoreWeave's global data center fleet. We don't just monitor infrastructure, we engineer the systems that make our massive scale manageable. This team is building the operational technology layer that powers real-time visibility, intelligent alerting, and automated response across one of the world's most advanced GPU compute platforms.
About the role:
As the Command Center Systems & Automation Engineer, you are the technical engine behind our operational platform. You will own the tools, integrations, and automation that give our operators the intelligence they need to act fast and accurately. From enhancing our unified monitoring platform to refining alert intelligence and automating repetitive tasks, your work directly reduces risk and improves uptime across CoreWeave’s global infrastructure.
- Own the evolution of the Command Center's "Single Pane of Glass", integrating signals from DCIM, EPMS, BMS, and other infrastructure platforms into a unified, real-time operational view.
- Transition alerting from threshold-based noise to intelligent, correlated logic using platforms such as Grafana, Prometheus, Ignition, or equivalent tools.
- Automate manual operator workflows, including reporting, vendor check-ins, initial triage, and ticket creation.
- Design and maintain data visualization dashboards and KPI reporting tools that give operators and leadership clear, actionable insight into fleet health and performance.
- Integrate and optimize Jira workflows to accelerate incident response, change management tracking, and task automation.
- Build and maintain technical runbooks and automated validation tools to ensure SOPs and MOPs are digitally enforced, not just documented.
- Engineer tracking for MTTD and MTTR, enabling data-driven performance improvement and RCA support.
- Partner with IT and Facilities Engineering teams to define system ownership boundaries and integration standards for in-scope platforms (DCIM, Ignition, BMS/EPMS).
Who You Are:
- 5+ years of experience in Data Center Operations, Site Reliability Engineering (SRE), Network Operations, or Advanced Manufacturing, in a mission-critical environment.
- Proficient in Python, Go, or other scripting languages for infrastructure automation and workflow orchestration.
- Hands-on experience with