About the Role
MBZUAI is a dedicated research lab focused on building and understanding foundation models. The Research Scientist in the Data team will curate high-quality data at web-scale, collaborate with cross-functional teams, and contribute to AI research that influences industries worldwide.
Responsibilities
- Pioneer web-scale data collection and curation methodologies for LLMs and multi-modal foundation models
- Design and implement novel data synthesis pipelines for code, mathematics, and agentic reasoning datasets
- Trace the impact of data from pre-training to final model capabilities and create automated quality assessment frameworks for massive datasets
- Design data recipes that maximize model capabilities across diverse domains
- Optimize data-model co-design for improved training dynamics
- Contribute to research papers and represent MBZUAI at industry conferences and events, showcasing the institution’s AI research and innovation
Required Qualifications
- Minimum: Master's in Computer Science, Data Science, or a related technical field, or equivalent practical experience required
- Experience working with large language models, including evaluation, fine-tuning, and prompt engineering
- Strong Python development skills with a focus on research-grade code and scalable data pipelines
- Familiarity with collecting and processing large-scale datasets from open-source and web resources
- Demonstrated ability to work with ML infrastructure (e.g., model evaluation, optimization, debugging)
- Proactive mindset with the ability to identify impactful research questions and execute on them with minimal supervision
- Effective communication and collaboration skills for working in cross-functional teams
Preferred Qualifications
- PhD or equivalent research experience in Machine Learning, NLP, or Data Science with a focus on LLMs and data is preferred
- Prior research experience in areas such as web data curation and mixing, synthetizing complex datasets for training, LLM evaluation, post-training data, efficient inference, LLM-as-a-judge, tokenization
- Strong publication record in leading AI conferences (e.g., NeurIPS, ICLR, ICML, EMNLP) and/or prior contributions to open-source AI research or data tools
- Hands-on experience training language/multi-modal models from scratch
Benefits
- Comprehensive medical, dental, and vision benefits
- Bonus
- 401K Plan
- Generous paid time off, sick leave and holidays
- Paid Parental Leave
- Employee Assistance Program
- Life insurance and disability