As a Software Engineer on our RL Training Infrastructure team, you will be responsible for designing, developing, and maintaining the infrastructure that supports our reinforcement learning training workloads. This includes building and operating large-scale distributed systems, developing tools and scripts to automate and streamline our workflows, and collaborating with other teams to ensure seamless integration with our broader infrastructure.
Key Responsibilities:
- Design and develop scalable and efficient infrastructure to support RL training workloads
- Develop tools and scripts to automate and streamline workflows
- Collaborate with other teams to ensure seamless integration with broader infrastructure
- Develop and maintain high-quality, well-documented code
- Troubleshoot and resolve infrastructure-related issues
Requirements:
- 3+ years of experience in software engineering, with a focus on distributed systems and infrastructure
- Strong understanding of machine learning and reinforcement learning concepts
- Experience with Python, Node.js, and AWS
- Excellent problem-solving and communication skills