Key Responsibilities
- Design and evaluate autonomous AI agents across multiple LLMs for real-world domains including health, education, and daily life
- Develop evaluation rubrics with objective pass/fail criteria for agent performance
- Debug agent traces to identify failure patterns and stress test against edge cases, prompt injection, and tool misuse
- Assess production-grade modular software architecture for multi-turn system interactions
- Provide high-density technical feedback for Large Language Model training
- Build and maintain production-grade software with modular separation (e.g., distinct services for data parsing, logic processing, and reporting)
Requirements
- Experience in backend engineering, AI automation, or complex systems integration
- Strong command of at least two major languages (e.g., Python, JavaScript, Go, or Java) and SQL databases
- Practical experience building for live, non-mocked environments with multi-turn system interactions
- Familiarity with persistent state and session-tracking patterns
- Ability to identify privacy leaks, authority escalation, or indirect prompt injection vulnerabilities