Role Summary
We are looking for a technically strong AI / ML Developer with hands-on expertise in training and fine-tuning Small Language Models (SLMs) and RAG Solutions. The ideal candidate will drive end-to-end AI Solutions — from dataset curation and pre-processing to training, evaluation, and production deployment. You will collaborate closely with product, product engineers, and infrastructure teams to build AI solutions that are efficient, scalable, and business-aligned.
Key Responsibilities
- Model Design & Training
- Design, train, and fine-tune Small Language Models (SLMs) using frameworks such as PyTorch, TensorFlow, or JAX.
- Conduct experiments with supervised fine-tuning (SFT), reinforcement learning from human feedback (RLHF), and instruction tuning.
- Implement efficient training pipelines leveraging distributed training (DDP, FSDP) across GPU/TPU clusters.
- Perform hyperparameter optimisation, ablation studies, and model selection based on benchmark results.
- Develop and maintain data pipelines for collecting, cleaning, tokenising, and pre-processing large-scale training corpora.
- Model Evaluation & Quality
- Define and implement evaluation frameworks including perplexity, BLEU, ROUGE, BERTScore, and task-specific benchmarks.
- Conduct red-teaming, bias analysis, and safety evaluations to ensure responsible AI deployment.
- Benchmark models against established baselines (e.g., GPT-2, Phi, Mistral) and track performance over iterations.
- Collaborate with QA teams to build regression suites for model versioning and continuous evaluation.
- MLOps & Deployment
- Containerise and deploy models using Docker, Kubernetes, and cloud-native ML platforms (AWS SageMaker / GCP Vertex AI / Azure ML).
- Build and maintain model registries, experiment tracking (MLflow, Comet), and reproducible training pipelines.
- Optimise inference performance through quantisation (INT8, INT4), pruning, distillation, and ONNX/TensorRT conversion.
- Monitor model drift, data drift, and performance degradation in production; implement automated retraining triggers.
- Research & Innovation
- Stay current with state-of-the-art NLP/LLM research; prototype and validate new techniques from published literature.
- Contribute to internal knowledge sharing, technical documentation, and model cards.
- Explore parameter-efficient fine-tuning (PEFT) methods such as LoRA, QLoRA, and Adapter layers.
- Investigate and apply mixture-of-experts (MoE), retrieval-augmented generation (RAG), and agentic workflows as needed.
- Collaboration & Stakeholder Management
- Work with product managers and domain experts to translate business requirements into model objectives and KPIs.
- Communicate model capabilities, limitations, and trade-offs clearly to both technical and non-technical stakeholders.
- Participate in architecture reviews, sprint planning, and cross-functional design discussions.
Required Qualifications
Education
- Bachelor's or Master's degree in Computer Science, Data Science, Machine Learning, Statistics, or a related quantitative field.
- Equivalent professional experience with a strong portfolio of delivered AI/ML projects will be considered.
Experience
- 4 – 6 years of professional experience in machine learning or NLP engineering.
- Minimum 2 years of direct, hands-on experience in training or fine-tuning language models (LLMs or SLMs).
- Demonstrable experience taking a model from dataset preparation through to production deployment.
Technical Skills — Core
- Programming: Python (proficient); familiarity with C++/CUDA a plus.
- Deep Learning Frameworks: PyTorch (primary), TensorFlow or JAX.
- NLP Libraries: Hugging Face Transformers, Datasets, PEFT, TRL, Accelerate.
- Training Infrastructure: Multi-GPU training, FSDP, DeepSpeed ZeRO stages 1–3.
- Data Engineering: Pandas, NumPy, Apache Spark or Dask for large-scale data prep.
- Vector DBs & Retrieval: FAISS, Pinecone, Weaviate, or equivalent.
- Cloud Platforms: AWS, GCP, or Azure — ML-specific services (SageMaker, Vertex AI, etc.).
- Version Control & CI/CD: Git, GitHub/GitLab, MLflow or W&B for experiment tracking.
Technical Skills — Nice to Have
- Experience with multimodal models or vision-language models (VLMs).
- Knowledge of model safety, alignment techniques, and responsible AI frameworks.
- Familiarity with LangChain, LlamaIndex, or similar orchestration frameworks.
- Contributions to open-source ML projects or publications at NeurIPS, ICML, EMNLP, ACL.
Behavioural Competencies
- Intellectual Curiosity: Keeps up with rapidly evolving AI research and applies learnings pragmatically.
- Problem Solving: Approaches ambiguous problems with structured, data-driven thinking.
- Ownership: Takes full accountability for model quality, timelines, and production reliability.
- Collaboration: Works effectively in cross-functional teams with diverse skill sets.
- Communication: Distils complex technical concepts for diverse audiences.
- Adaptability: Thrives in a fast-paced environment where priorities evolve with research breakthroughs.