Key Responsibilities
- Design and optimize training and inference systems for machine learning workloads
- Develop and maintain scalable infrastructure for ML model training and deployment
- Collaborate with ML teams to improve model performance and efficiency
- Implement MLOps practices for model lifecycle management
- Monitor and troubleshoot system performance and reliability
Requirements
- 3+ years of experience in ML infrastructure or related fields
- Proficiency in GPU computing, Kubernetes, and Docker
- Strong programming skills in Python and familiarity with ML frameworks
- Experience with MLOps tools (e.g., MLflow, Kubeflow)
- Understanding of distributed computing and parallel processing