About the Role
As a Senior MLOps Engineer, you will be the backbone of our AI production lifecycle. You will bridge the gap between research and real-world application, ensuring our Data Scientists, AI Researchers, Product teams, and others in the company have the high-performance infrastructure, automated pipelines, and deployment strategies needed to ship state-of-the-art models and agents at scale.
What You’ll be Doing
Infrastructure & Platform Engineering
- Infrastructure as Code (IaC): Design and maintain scalable cloud environments (GCP/AWS) using Terraform.
- Resource Provisioning: Manage GPU/TPU resource allocation for training, fine-tuning, and interactive notebooks.
- Custom Tooling: Build internal services and CLI tools to streamline the developer experience for the AI team.
ML & LLM Orchestration & Observability
- Automated Pipelines: Design CI/CD and training pipelines using tools such as GitHub Actions, MLFlow, Vertex AI Pipelines. Ensure high quality training data (e.g., introducing a feature store).
- Deployment Methodology: Develop reusable patterns for model serving. Managing service deployments to Kubernetes.
- Vector Infrastructure: Manage and optimize vector databases and embedding pipelines for RAG-based systems.
- Observability and Reliability: Model drift monitoring, resource utilisation, LLM and agent tracing.
Performance & Optimization
- Inference Optimization: Implement techniques to reduce latency and increase throughput (quantisation, distillation, etc.).
- Cold Start Mitigation: Solve scaling bottlenecks for serverless or containerized model deployments.
- Cost Management: Optimize GPU utilization and cloud spend without compromising performance.
AI Enablement
- Support AI Agent Deployment: Define and create tooling and service templates around agent deployment (tool libraries, tracing, default agent frameworks, skills, etc.).
- Enablement for non-technical agent users: Help create workflows and guidance on no-code/low-code agent platforms (n8n, LangSmith, or similar). Create tooling and policies to enable safe usage of local agents such as Claude code.
Who We’re Looking For
- 5+ years experience with cloud infrastructure and infrastructure as code.
- Previous experience with the ML and LLM lifecycle - training, hosting, optimisation, observability.
- Used to working closely with researchers and data scientists - taking experiments from worksheets into production.
- Strong grasp of ML fundamentals and modern GenAI stack.