About the Role

This is a software engineering role on an ML team. You'll own the systems that make large-scale model training fast, reliable, and pleasant to work with, the distributed training framework, the data pipelines feeding it, the performance characteristics of every step on the critical path, and the day-to-day developer experience for the researchers who depend on it.

You don't need to come in as an ML expert. You do need to be a strong engineer who gets excited about hard systems problems: squeezing throughput out of accelerator clusters, hunting down stragglers across hundreds of machines, designing abstractions that hold up as the codebase grows, and making the unglamorous parts of training infrastructure work well.

If you've ever looked at a large-scale system and thought "there's no reason this should take this slow / inefficient / hard to maintain / complex," this role is built for you.

Key Responsibilities

Build and maintain the distributed training framework: orchestration, checkpointing, fault tolerance, observability, and the ergonomics researchers interact with daily.
Profile end-to-end training runs and eliminate bottlenecks wherever they live- compute, memory, interconnect, storage, or the data pipeline.
Collaborate with researchers to translate model ideas into training code that runs efficiently, and flag when an architectural choice will be expensive before it ships.
Own a shared codebase the team relies on: correctness, readability, testing, and long-term maintainability matter as much as the benchmark numbers.
Work close to the metal where it pays off- write or integrate custom GPU kernels, tune collective communication, and exploit hardware features that off-the-shelf frameworks leave on the table.

Your skills and experience

2+ years of professional software engineering experience, ideally including work on performance-sensitive or distributed systems.
Strong software engineering fundamentals. You write clean, tested, maintainable Python, and you're comfortable reading and writing modern C++.
Real experience with performance work- profiling, optimization, and reasoning about systems where latency, throughput, and resource contention actually matter.
Comfort with distributed systems: you've debugged things that only break at scale and have intuitions for where they tend to go wrong.
A bias toward understanding systems end-to-end rather than treating any layer as a black box.
Familiarity with Kubernetes or similar environments for running and scaling large workloads.

* ML training experience is a bonus. If you have it, great, but we'd rather hire a strong systems engineer who's curious about ML than an ML engineer who's lukewarm about infrastructure.

Nice to have

Working knowledge of at least one accelerator architecture (GPU, TPU, or similar), or a clear track record of going deep on hardware when the problem calls for it.
Experience with JAX/Pallas, Triton, CUDA, OpenCL, Metal, or similar accelerator programming.
Prior exposure to ML training pipelines, even informally- pet projects count.

Software Engineer (Large Scale Training)

About the Role

Key Responsibilities

Your skills and experience

Nice to have

View Assessment Process

Think you'll be a good fit?