Key Responsibilities
- Design and evaluate autonomous AI agents across multiple LLMs for real-world domains like health, education, and daily life
- Develop evaluation rubrics with objective pass/fail criteria for agent performance
- Debug agent traces to identify failure patterns and stress test against edge cases, prompt injection, and tool misuse
- Assess production-grade modular software architecture and analyze multi-turn system interactions
- Provide high-density technical feedback for training Large Language Models (LLMs)
Requirements
- Experience in backend engineering, AI automation, or complex systems integration
- Proven ability to build and maintain production-grade software with modular separation (e.g., distinct services for data parsing, logic processing, and reporting)
- Strong command of at least two major languages (e.g., Python, JavaScript, Go, or Java) and experience with SQL databases
- Practical experience building for live, non-mocked environments and handling multi-turn system interactions