About the Role
Cohere is seeking a Site Reliability Engineer, Inference Infrastructure. This role focuses on the reliability and performance of our inference infrastructure.
Responsibilities
- Ensure high availability and reliability of Cohere's inference systems.
- Optimize the performance and scalability of inference infrastructure.
- Work across teams to troubleshoot and resolve production issues.
- Implement and maintain monitoring, alerting, and logging solutions.
- Develop automation to streamline operational tasks.
Requirements
- Experience with site reliability engineering or a similar role.
- Strong background in managing large-scale distributed systems.
- Proficiency in cloud platforms (e.g., AWS, GCP, Azure).
- Experience with containerization and orchestration (e.g., Docker, Kubernetes).
- Solid understanding of Linux operating systems.
- Ability to work in a fast-paced, dynamic environment.