Key Responsibilities

Design and evaluate autonomous AI agents across multiple LLMs for real-world domains like health, education, and daily life
Develop evaluation rubrics with objective pass/fail criteria for agent performance
Debug agent traces to identify failure patterns and stress test against edge cases, prompt injection, and tool misuse
Assess production-grade modular software architecture and analyze multi-turn system interactions
Provide high-density technical feedback for training Large Language Models (LLMs)

Requirements

Experience in backend engineering, AI automation, or complex systems integration
Proven ability to build and maintain production-grade software with modular separation (e.g., distinct services for data parsing, logic processing, and reporting)
Strong command of at least two major languages (e.g., Python, JavaScript, Go, or Java) and experience with SQL databases
Practical experience building for live, non-mocked environments and handling multi-turn system interactions

Software Developer - Hire Feed

View Assessment Process