As a Site Reliability Engineer on the Frontier Systems Infrastructure team at OpenAI, you will be responsible for designing and implementing scalable and reliable systems infrastructure. You will work closely with the engineering teams to ensure the reliability and performance of our systems, and collaborate with other teams to identify and prioritize infrastructure needs.
Key Responsibilities:
- Design and implement scalable and reliable systems infrastructure
- Collaborate with engineering teams to ensure system reliability and performance
- Identify and prioritize infrastructure needs with other teams
- Develop and maintain monitoring and alerting systems
- Improve system efficiency and reduce costs
Requirements:
- 3+ years of experience in site reliability engineering or a related field
- Strong understanding of cloud computing platforms, such as AWS
- Experience with containerization and orchestration tools, such as Docker and Kubernetes
- Proficiency in programming languages, such as Python and Node.js
- Strong problem-solving skills and ability to work in a fast-paced environment