logo

Mind Robotics

Machine Learning Infrastructure Engineer

Department
Engineering
Job Type / Location
Palo Alto
Experience Required
3+ years
Posted On

The Role

At Mind Robotics, we’re building generalized physical AI—robotic systems capable of dexterous, adaptive, and reasoning-intensive work in real-world industrial environments. Our ability to iterate quickly on large-scale models depends on world-class ML infrastructure.

We’re looking for a Machine Learning Infrastructure Engineer to build the core systems that enable fast, reliable, and scalable model training—powering everything from experimentation to production deployment.

Responsibilities

  • Design and implement scalable systems for training large ML models
  • Enable efficient workflows for data ingestion, training, and iteration
  • Develop and optimize distributed training systems across hundreds of GPUs
  • Implement strategies for parallelization, sharding, and efficient compute utilization
  • Improve training efficiency through techniques such as attention optimizations, kernel fusion, and memory management
  • Partner closely with modeling teams to accelerate iteration speed and reduce training costs
  • Build internal tools for experiment tracking, monitoring, and debugging
  • Implement systems for tracking training performance, failures, and resource utilization
  • Debug and resolve bottlenecks across the training stack
  • Provide lightweight infrastructure support for deploying and running models in production environments
  • Optimize inference performance and reliability where needed
  • Support core cloud infrastructure needs for training workloads (without heavy DevOps overhead)
  • Manage compute resources efficiently across training jobs

Qualifications

  • Strong experience building infrastructure for large-scale ML training
  • Deep understanding of how modern LLM/VLM systems are trained and scaled
  • Proven experience setting up and scaling distributed training across hundreds of GPUs
  • Strong understanding of parallelization strategies (data, model, pipeline parallelism)
  • Strong proficiency in Python programming
  • Expert-level proficiency in PyTorch and/or JAX
  • Strong understanding of techniques like attention optimization, kernel fusion, and efficient memory usage

Nice to Have

  • Experience supporting inference systems in production
  • Familiarity with robotics or embodied AI workloads
  • Experience building tools for experiment management and researcher productivity

View Assessment Process

Think you'll be a good fit?