logo

Mecka AI

Research Scientist, Video Understanding & World Models

Department
Research
Job Type / Location
New York
Experience Required
2+ years
Posted On

About the Role

We are looking for a Research Scientist, Video Understanding to own Mecka’s video understanding agenda end-to-end: train large-scale video representation and video-language models on our egocentric + stereo corpus, and turn the resulting checkpoints into production signals the rest of the stack ships on.

This role is focused on large model training, video encoders, video-language models, VLMs/VLAs, and temporal representation learning on real-world robotics data.

What You’ll Work On

Large-Scale Training & Architecture

  • Own model architecture and training strategy across Mecka’s task families (manipulation, locomotion, daily activity, long-horizon behavior).
  • Run self-supervised and multimodal pretraining (VideoMAE / VJEPA / VideoPrism / InternVideo-class) with rigorous evals and clean ablations.

Video-Language & Multimodal Modeling

  • Train and fine-tune video encoders and video-language models (temporal transformers, joint-embedding models, contrastive objectives, masked modeling, instruction/video alignment).
  • Incorporate useful priors (pose, depth, camera motion, optical flow) when it improves representation quality.

Research → Production Signals

  • Turn checkpoints into usable artifacts: embeddings and model outputs that downstream systems can reliably consume (retrieval, labeling, QA, analytics).
  • Build a disciplined training + eval workflow with regression tracking and reproducible runs.

Who You Are

Required Background

  • Deep experience training large models in PyTorch (or equivalent), including multi-GPU or distributed training.
  • Strong understanding of modern video representation learning and/or multimodal modeling.
  • Ability to run rigorous experiments and communicate results clearly.

Strong Signals:

  • Experience with video VLMs / VLA-adjacent systems (VideoCLIP, InstructBLIP-Video, LLaVA-Video-class).
  • Experience with egocentric / embodied datasets (Ego4D, EgoExo4D, EPIC-Kitchens, Something-Something).
  • Strong software engineering discipline: you write research code that can be shipped.

Why This Role

  • Work on a domain — egocentric embodied video — where data is scarce everywhere except here.
  • Own a research agenda that directly feeds production systems and product outcomes.

View Assessment Process

Think you'll be a good fit?