Key Responsibilities
- Build and maintain scalable infrastructure for machine learning model training and deployment
- Develop CI/CD pipelines for ML model versioning and testing
- Optimize GPU/TPU resource allocation for training workloads
- Collaborate with data scientists to streamline model deployment workflows
- Monitor and troubleshoot infrastructure performance and reliability
Requirements
- 3+ years of experience in ML infrastructure or related roles
- Proficiency in containerization (Docker) and orchestration (Kubernetes)
- Experience with MLOps tools (MLflow, Kubeflow, or similar)
- Strong scripting skills (Python, Bash) and cloud platform knowledge (AWS/GCP)
- Familiarity with distributed computing and GPU acceleration