About the Role

As a Senior MLOps Engineer, you will be the backbone of our AI production lifecycle. You will bridge the gap between research and real-world application, ensuring our Data Scientists, AI Researchers, Product teams, and others in the company have the high-performance infrastructure, automated pipelines, and deployment strategies needed to ship state-of-the-art models and agents at scale.

What You’ll be Doing

Infrastructure & Platform Engineering

Infrastructure as Code (IaC): Design and maintain scalable cloud environments (GCP/AWS) using Terraform.
Resource Provisioning: Manage GPU/TPU resource allocation for training, fine-tuning, and interactive notebooks.
Custom Tooling: Build internal services and CLI tools to streamline the developer experience for the AI team.

ML & LLM Orchestration & Observability

Automated Pipelines: Design CI/CD and training pipelines using tools such as GitHub Actions, MLFlow, Vertex AI Pipelines. Ensure high quality training data (e.g., introducing a feature store).
Deployment Methodology: Develop reusable patterns for model serving. Managing service deployments to Kubernetes.
Vector Infrastructure: Manage and optimize vector databases and embedding pipelines for RAG-based systems.
Observability and Reliability: Model drift monitoring, resource utilisation, LLM and agent tracing.

Performance & Optimization

Inference Optimization: Implement techniques to reduce latency and increase throughput (quantisation, distillation, etc.).
Cold Start Mitigation: Solve scaling bottlenecks for serverless or containerized model deployments.
Cost Management: Optimize GPU utilization and cloud spend without compromising performance.

AI Enablement

Support AI Agent Deployment: Define and create tooling and service templates around agent deployment (tool libraries, tracing, default agent frameworks, skills, etc.).
Enablement for non-technical agent users: Help create workflows and guidance on no-code/low-code agent platforms (n8n, LangSmith, or similar). Create tooling and policies to enable safe usage of local agents such as Claude code.

Who We’re Looking For

5+ years experience with cloud infrastructure and infrastructure as code.
Previous experience with the ML and LLM lifecycle - training, hosting, optimisation, observability.
Used to working closely with researchers and data scientists - taking experiments from worksheets into production.
Strong grasp of ML fundamentals and modern GenAI stack.

Senior MLOps Engineer