Role Overview
Airbnb relies heavily on AI and ML to enhance experiences for guests and hosts across various functions like Trust, Payments, Customer Service, and Marketing. The Core ML team spearheads CSxAI (Customer Support x Artificial Intelligence) initiatives by leveraging Generative AI to deliver intelligent, scalable, and exceptional service. This involves developing and enhancing AI models, ML services, and tools, including LLM fine-tuning and optimization, RAG/Search, LLM evaluation and testing automation, feedback-based learning, and guardrails for diverse Airbnb applications. Given the rich data, marketplace complexity, and product variety at Airbnb, the team operates at the forefront of AI practice, committed to long-term innovation.
The Difference You Will Make
In this Senior Staff role, you will define the technical direction and lead execution for ML evaluation and the end-to-end data flywheel that powers CSxAI products, such as assistive agents, issue resolution, and tooling. Your contributions will be critical in establishing how quality is measured, how feedback translates into learning signals, and how models and products are continuously and safely improved. You will collaborate closely with product, engineering, and design teams to build evaluation systems that are trusted, scalable, and actionable, linking offline metrics directly to online outcomes.
A Typical Day
- Define evaluation strategy and success metrics for GenAI systems, ensuring alignment between offline evaluation and online business and customer experience outcomes.
- Build and scale evaluation frameworks, including golden sets, synthetic data, automated regressions, rubric-based grading, and LLM-as-judge (where appropriate), with robust controls for bias, drift, and reliability.
- Design the data flywheel: covering instrumentation, feedback collection, data quality checks, labeling strategy, dataset versioning, and governance to support continuous improvement.
- Lead cross-functional quality initiatives across product, operations, and engineering, providing clarity on quality standards and enabling teams to act effectively on evaluation results.
- Develop and productionize pipelines for dataset creation, model monitoring, evaluation-at-scale, and continuous testing (both pre-deploy and post-deploy).
- Drive technical decisions and architecture for evaluation and data infrastructure, balancing considerations of speed, rigor, cost, and safety.
Minimum Qualifications
- Educational Background: PhD in Computer Science, Mathematics, Statistics, or a related technical field (or equivalent practical experience).
- Industry Experience: 10+ years of experience building, testing, and shipping ML/AI systems end-to-end, including 2+ years of experience with GenAI/LLM systems in production.
- Leadership Experience: 5+ years leading large, ambiguous technical initiatives as a senior Individual Contributor, influencing roadmap and engineering/science direction across teams.
- Technical Proficiency:
- Deep expertise in evaluation methodology (offline/online alignment, metric design, human-in-the-loop evaluation, A/B testing, power analysis, regression testing).
- Hands-on experience with GenAI systems, including orchestration, retrieval, tool calling, and memory.
- Experience building data pipelines and quality systems (labeling workflows, dataset curation, versioning, monitoring, and governance).
- Solid ML fundamentals and best practices (model selection, training/serving, monitoring, reliability, and model lifecycle management).
Preferred Qualifications
- Customer Support Systems: Experience applying ML/AI to customer support workflows (e.g., agent assist, classification/routing, resolution recommendation, QA).
- Infrastructure & Quality at Scale: Experience building robust evaluation platforms for agent behavior validation, safety/guardrails, and continuous improvement.
- Agile Practice for Applied AI: Proven ability to take evaluation and data flywheel work from incubation to production, iterating quickly while maintaining scientific rigor.