LLM Evaluation Engineer / Generative AI Quality Engineer - W2 Only

Key Responsibilities

Design and develop automated evaluation pipelines for LLM and agentic AI systems
Create evaluation scenarios and adversarial test datasets to identify model edge cases and bias
Assess AI outputs using metrics such as task success rate, semantic similarity, and sentiment analysis
Analyze and debug agent reasoning, tool usage, and action sequences to identify failure points
Develop advanced prompt engineering strategies to test reasoning, planning, and instruction adherence
Build and maintain custom evaluation frameworks using Python

Requirements

6+ years of hands-on experience in AI model evaluation and debugging
Strong proficiency in Python, automation scripts, and evaluation frameworks
Deep understanding of large language model behaviors, failure modes, and quality validation techniques
Experience with AI failure mode analysis (hallucination, incoherence, jailbreaking)
Familiarity with automated testing frameworks (Pytest) and evaluation metrics

View Assessment Process