Key Responsibilities
- Design and evaluate autonomous AI agents across multiple LLMs for real-world domains including health, education, and daily life
- Develop evaluation rubrics with objective pass/fail criteria for AI agent performance
- Debug agent traces to identify failure patterns and stress test agents against edge cases
- Assess production-grade modular software architecture for AI systems
- Analyze multi-turn system interactions and behaviors to provide high-density technical feedback
- Create and maintain scalable workflows for AI agent training and deployment
Requirements
- Experience in backend engineering, AI automation, or complex systems integration
- Proven ability to build and maintain production-grade software with modular separation
- Strong command of at least two major languages (e.g., Python, JavaScript, Go, or Java) and experience with SQL databases
- Practical experience building for live, non-mocked environments and handling multi-turn system interactions
- Familiarity with persistent state and session-tracking patterns