Key Responsibilities
- Design and evaluate autonomous AI agents across multiple LLMs for real-world domains like health, education, and daily life
- Develop evaluation rubrics with objective pass/fail criteria for AI agent performance
- Debug agent traces to identify failure patterns and stress test agents against edge cases
- Assess production-grade modular software architecture for multi-turn system interactions
- Provide high-density technical feedback for training Large Language Models (LLMs)
- Handle multi-turn system interactions and ensure robust tool integration
Requirements
- Experience in backend engineering, AI automation, or complex systems integration
- Strong command of at least two major languages (e.g., Python, JavaScript, Go, or Java) and SQL databases
- Proven ability to build and maintain production-grade software with modular separation
- Practical experience with live, non-mocked environments and multi-turn system interactions
- Familiarity with persistent state, session-tracking patterns, and security vulnerabilities like prompt injection