CoreWeave is The Essential Cloud for AI™. Built for pioneers by pioneers, CoreWeave delivers a platform of technology, tools, and teams that enables innovators to build and scale AI with confidence. Trusted by leading AI labs, startups, and global enterprises, CoreWeave combines superior infrastructure performance with deep technical expertise to accelerate breakthroughs and turn compute into capability. Founded in 2017, CoreWeave became a publicly traded company (Nasdaq: CRWV) in March 2025. Learn more at www.coreweave.com .
What You’ll Do:
CoreWeave is seeking a Staff Technical Program Manager to lead complex, cross-functional programs across Cluster Orchestration and Applied Training within our AI/ML Platform Services organization.
Cluster Orchestration is the platform layer that makes sure large AI workloads are scheduled, launched, and managed reliably across CoreWeave’s clusters. Applied Training is the layer on top of that infrastructure that helps researchers and customers use it for pre-training, fine-tuning, reinforcement learning, evaluations, and sandboxed experimentation.
About the role:
In this role, you will partner with engineering, product, infrastructure, and research-adjacent teams to improve both how workloads run on the cluster and how users interact with the training platform built on top of it. That includes driving programs across orchestration systems such as Slurm-on-Kubernetes (SUNK), Kueue, and workflow integrations, while also helping scale the environments, tooling, and operational mechanisms that make training and evaluation workflows easier to use.
This is a highly cross-functional role for a TPM who combines strong technical depth, excellent execution instincts, and the ability to bring structure and clarity to fast-moving infrastructure and AI platform initiatives.
Responsibilities:
- Drive end-to-end program execution for cluster orchestration initiatives spanning workload scheduling, self-service provisioning, upgrade and migration flows, and platform integrations.
- Lead cross-functional programs that improve how AI training, evaluation, RL, and mixed workloads run across CoreWeave clusters.
- Partner with engineering and product leaders to define roadmap priorities and deliver measurable improvements in utilization, reliability, scalability, observability, and user experience.
- Drive delivery for applied training initiatives across pre-training, fine-tuning, reinforcement learning, sandbox environments, and evaluation systems.
- Coordinate dependencies across platform engineering, infrastructure, product, customer-facing teams, and ecosystem partners to ensure successful launches and clear operational ownership.
- Build program mechanisms for release readiness, rollout planning, risk management, stakeholder communication, and post-launch review.
- Establish success metrics, dashboards, and operating cadences to improve