Key Responsibilities
- Design and maintain GPU-accelerated infrastructure for AI workloads
- Develop orchestration systems for efficient GPU resource allocation
- Optimize cloud infrastructure for high-performance computing needs
- Implement monitoring and alerting for GPU utilization and health
- Collaborate with ML teams to scale infrastructure for training and inference
- Automate deployment and management of GPU clusters
Requirements
- 3+ years of experience in infrastructure engineering or DevOps
- Expertise in Kubernetes, Docker, and cloud platforms
- Experience with GPU virtualization and orchestration
- Strong scripting skills (Python/Bash) for automation
- Familiarity with monitoring tools (Prometheus/Grafana)