About the Role

We’re looking for an AI-Engineer within Distributed LLM Training & Infrastructure to work on large-scale model training infrastructure across distributed GPU environments. This role focuses on improving how LLMs are trained at scale optimising performance, cost, and efficiency across multi-node systems.

Responsibilities

Build and optimise distributed LLM training pipelines using PyTorch
Work with frameworks such as Megatron-LM and DeepSpeed for large-scale training
Improve multi-node GPU performance (throughput, memory usage, NCCL communication)
Design and run benchmarking frameworks (tokens/sec, cost, MFU, latency)
Develop standardised training recipes and playbooks for production-grade environments

What You’ll Work On

Core LLM training systems (not application-layer AI)
Distributed systems challenges across multi-GPU, multi-node setups
Performance optimisation and scaling of large models in production environments

Ideal Background

5-7 Years Experience in distributed ML / ML systems
Strong hands-on experience with PyTorch and multi-node, multi-GPU training
Deep understanding of parallelism strategies (FSDP, tensor, pipeline)
Exposure to Megatron-LM, DeepSpeed, or similar training frameworks
Strong focus on benchmarking, optimisation, and improving training efficiency

Why This Role

Work on high-impact problems in large-scale AI training and infrastructure
High ownership within a lean, senior team
Opportunity to define training standards and best practices for scalable AI systems

Artificial Intelligence Engineer

About the Role

Responsibilities

What You’ll Work On

Ideal Background

Why This Role

View Assessment Process

Think you'll be a good fit?