Key Responsibilities
- Lead a cross-functional team of ML engineers and platform specialists to build and scale ML infrastructure
- Define technical roadmaps for ML platform capabilities, ensuring alignment with business and research goals
- Optimize CI/CD pipelines, model deployment workflows, and data infrastructure for performance and reliability
- Collaborate with research scientists to translate experimental models into production-grade systems
- Establish best practices for security, compliance, and cost-efficiency in ML operations
- Mentor engineers and foster a culture of innovation, ownership, and technical excellence
Requirements
- Proven track record leading ML platform teams in production environments
- Deep expertise in distributed systems, cloud platforms (AWS/GCP), and containerization (Kubernetes/Docker)
- Hands-on experience with ML frameworks (PyTorch, TensorFlow) and MLOps tools (MLflow, Kubeflow)
- Strong background in software engineering with proficiency in Python, Go, or Java
- Experience scaling ML systems to handle petabyte-scale datasets and high-throughput inference