CoreWeave is The Essential Cloud for AI™. Built for pioneers by pioneers, CoreWeave delivers a platform of technology, tools, and teams that enables innovators to build and scale AI with confidence. Trusted by leading AI labs, startups, and global enterprises, CoreWeave combines superior infrastructure performance with deep technical expertise to accelerate breakthroughs and turn compute into capability. Founded in 2017, CoreWeave became a publicly traded company (Nasdaq: CRWV) in March 2025. Learn more at www.coreweave.com .
About the role
As a Staff Software Engineer, you will define and drive the technical vision for GPU performance validation and infrastructure testing across CoreWeave's global fleet. You will lead large-scale initiatives spanning hardware validation, performance benchmarking, Kubernetes infrastructure, and AI/ML platform reliability.
This role requires deep technical expertise combined with the ability to influence architecture, engineering practices, and organizational priorities across multiple teams. You will partner closely with Fleet Engineering, Infrastructure, Product, Hardware, and AI Platform teams to ensure CoreWeave delivers industry-leading performance, reliability, and efficiency for GPU workloads at hyperscale.
What You'll Do
- Define the long-term technical strategy and architecture for CoreWeave's GPU performance testing and validation platform.
- Lead the design and implementation of scalable systems for validating performance, reliability, and health across CoreWeave's global infrastructure footprint.
- Drive cross-functional initiatives spanning infrastructure testing, hardware qualification, fleet provisioning, and AI infrastructure performance optimization.
- Architect and develop backend services, APIs, and automation frameworks in Go and/or Python that support large-scale testing and validation workflows.
- Design and oversee Kubernetes-native testing platforms, operators, and controllers used across thousands of GPUs and clusters.
- Establish performance benchmarks, testing methodologies, and operational standards for new hardware platforms and infrastructure deployments.
- Influence engineering standards, deployment strategies, observability practices, and reliability frameworks across multiple teams.
- Identify and solve systemic performance bottlenecks impacting customer workloads, infrastructure efficiency, and fleet utilization.
- Partner with hardware vendors and internal stakeholders to evaluate emerging technologies and shape future infrastructure investments.
- Mentor senior engineers and act as a technical leader across the organization through design reviews, architecture discussions, and strategic initiatives.
- Serve as a key technical decision-maker during critical incidents involving performance, scalability, and infrastructure reliability.