logo

MBZUAI (Mohamed bin Zayed University of Artificial Intelligence)

Research Scientist - Data

Department
Research
Job Type / Location
Sunnyvale
Experience Required
5+ years
Posted On

About the Role

MBZUAI is a dedicated research lab focused on building and understanding foundation models. The Research Scientist in the Data team will curate high-quality data at web-scale, collaborate with cross-functional teams, and contribute to AI research that influences industries worldwide.

Responsibilities

  • Pioneer web-scale data collection and curation methodologies for LLMs and multi-modal foundation models
  • Design and implement novel data synthesis pipelines for code, mathematics, and agentic reasoning datasets
  • Trace the impact of data from pre-training to final model capabilities and create automated quality assessment frameworks for massive datasets
  • Design data recipes that maximize model capabilities across diverse domains
  • Optimize data-model co-design for improved training dynamics
  • Contribute to research papers and represent MBZUAI at industry conferences and events, showcasing the institution’s AI research and innovation

Required Qualifications

  • Minimum: Master's in Computer Science, Data Science, or a related technical field, or equivalent practical experience required
  • Experience working with large language models, including evaluation, fine-tuning, and prompt engineering
  • Strong Python development skills with a focus on research-grade code and scalable data pipelines
  • Familiarity with collecting and processing large-scale datasets from open-source and web resources
  • Demonstrated ability to work with ML infrastructure (e.g., model evaluation, optimization, debugging)
  • Proactive mindset with the ability to identify impactful research questions and execute on them with minimal supervision
  • Effective communication and collaboration skills for working in cross-functional teams

Preferred Qualifications

  • PhD or equivalent research experience in Machine Learning, NLP, or Data Science with a focus on LLMs and data is preferred
  • Prior research experience in areas such as web data curation and mixing, synthetizing complex datasets for training, LLM evaluation, post-training data, efficient inference, LLM-as-a-judge, tokenization
  • Strong publication record in leading AI conferences (e.g., NeurIPS, ICLR, ICML, EMNLP) and/or prior contributions to open-source AI research or data tools
  • Hands-on experience training language/multi-modal models from scratch

Benefits

  • Comprehensive medical, dental, and vision benefits
  • Bonus
  • 401K Plan
  • Generous paid time off, sick leave and holidays
  • Paid Parental Leave
  • Employee Assistance Program
  • Life insurance and disability

View Assessment Process

Think you'll be a good fit?