About the Role

At JetBrains, we are passionate about code and creating robust, effective developer tools. We are building an ambitious new platform that integrates AI capabilities across all JetBrains products, utilizing in-house developed models for coding assistance and strategic partner integrations.

We are seeking a Research Engineer to contribute to training foundation models for coding tasks. In this role, you will be responsible for developing Large Language Models from scratch and deploying them into production environments for global end-users.

We value engineers who:

Can plan projects and make decisions independently, consulting with others as needed.
Identify customer needs and prioritize tasks accordingly.
Start with the simplest solutions and gradually add complexity.
Take sole responsibility for an entire subsystem.
Have a passion for learning and a desire to stay up to date with the latest developments in the LLM field.

In this role, you will:

Work with stakeholders to convert business requirements into technical specifications.
Train LLMs from scratch on a large GPU cluster.
Collect and process pre-training and fine-tuning datasets.
Support and improve existing subsystems.

Requirements

We’ll be happy to have you on our team if you have:

Experience in design, deployment, and support of production ML systems.
A strong theoretical background in NLP and transformer-based approaches.
Proficiency with modern deep learning frameworks such as PyTorch and common libraries for NLP.
Experience in distributed training of multi-billion parameter models.
Attention to detail in everything you do and great communication skills.

We’d be especially thrilled if you have experience with:

LLM inference frameworks such as vLLM, DeepSpeed, TensorRT.
LLM alignment techniques such as RLHF/RLAIF.
MLOps tools and practices, including CI/CD for ML.
K8s and Kubeflow.
Scientific publications in the NLP field.

How we develop JetBrains AI:

A cluster of hundreds of NVIDIA GPUs as training infrastructure.
Git for source control management.
Python, PyTorch, and HuggingFace as an ML stack.
Kubeflow and Weights & Biases for experiment tracking.
TeamCity as a CI Automation system.

Staff Research Engineer (LLM Pre-Training)