Key Responsibilities
- Design and maintain GPU-accelerated Kubernetes clusters
- Optimize resource allocation for ML workloads
- Develop automation for cluster provisioning and scaling
- Monitor infrastructure performance and reliability
- Collaborate with ML teams to optimize workload placement
- Implement security best practices for multi-tenant clusters
Requirements
- 4+ years in cloud infrastructure or systems engineering
- Deep Kubernetes and containerization expertise
- Experience with GPU scheduling and resource management
- Proficiency in cloud platforms (AWS/GCP/Azure)
- Strong scripting and automation skills