Key Responsibilities
- Design and develop automated evaluation pipelines for LLM and agentic AI systems
- Create evaluation scenarios and adversarial test datasets to identify model edge cases and bias
- Assess AI outputs using metrics such as task success rate, semantic similarity, and sentiment analysis
- Analyze and debug agent reasoning, tool usage, and action sequences to identify failure points
- Develop advanced prompt engineering strategies to test reasoning, planning, and instruction adherence
- Build and maintain custom evaluation frameworks using Python
Requirements
- 6+ years of hands-on experience in AI model evaluation and debugging
- Strong proficiency in Python, automation scripts, and evaluation frameworks
- Deep understanding of large language model behaviors, failure modes, and quality validation techniques
- Experience with AI failure mode analysis (hallucination, incoherence, jailbreaking)
- Familiarity with automated testing frameworks (Pytest) and evaluation metrics