ML Systems Engineer – Robotics & AI

We are building the full-stack foundation for the next generation of humanoid robots, from high-performance, software-defined hardware to the foundational models and video world models that control them. Our robots are designed to be generalists capable of operating in complex, real-world environments and handling scenarios unseen in training. We work at the intersection of large-scale learning, robotics, and systems, with a research team of top-tier experts from leading universities and institutions. We are creating a new computing platform for physical work, with significant funding invested aggressively into R&D, hardware development, and manufacturing scale-up.
We are hiring a Staff/Principal ML Systems Engineer to own training performance end-to-end and turn our training platform into a high-efficiency, high-reliability engine for research iteration.

What You Will Do

Own performance at scale.
Diagnose and improve end-to-end training performance for large models trained on multimodal robotic data including vision, proprioception, actions, language, and video.
Build a repeatable workflow for performance attribution, including step-time breakdown, scaling curves, and bottleneck identification at different GPU counts.
Drive measurable gains in distributed efficiency, compute efficiency, and memory efficiency.
Make performance observable and durable through source-of-truth metrics, dashboards, and automated regression detection.
Partner closely with researchers and research engineers to translate model changes into scalable implementations.
Provide guidance on training strategy tradeoffs relevant to robotics world models.
Reduce the operational burden on researchers so they can focus on model quality and robotic behavior.
Collaborate on cluster efficiency with infrastructure teams to reduce wasted GPU-hours from stragglers, degraded nodes, network health issues, checkpoint stalls, and scheduler placement issues.

What We Are Looking For

Significant experience delivering distributed training performance improvements in production research environments, with large-scale GPU training strongly preferred.
Strong hands-on experience with modern training stacks such as PyTorch, with familiarity in JAX a plus.
Deep understanding of distributed training concepts and tradeoffs including sharded training, tensor/pipeline parallelism, gradient accumulation, comm/compute overlap, and diagnosing collective communication performance.
Strong debugging and measurement instincts with the ability to turn ambiguous performance issues into clear bottlenecks and validated fixes.
Comfortable operating in a fast-moving startup environment with high ownership and minimal bureaucracy.

Nice to Have

Experience with GPU kernel-level performance work including CUDA or Triton, fused ops, and compiler/graph capture.
Experience with multimodal or video training and variable-length sequence packing or bucketing.
Experience building observability systems for ML training including metrics, logs, traces, dashboards, and alerting.
Familiarity with large-cluster scheduling or topology-aware placement in Slurm, Kubernetes, or HPC environments.

Why This Role

Direct impact on model iteration speed, translating into faster research cycles and better robotic capability.
Work at the frontier of large-scale training for real-world robotics, not toy benchmarks.
Tight collaboration between systems, research, and infrastructure with no silos.
High ownership in a small, ambitious team building foundational technology.
Meaningful leverage, as improvements you make compound across every training run executed by the research team.

ML Infra Engineer (Distributed Training)

ML Systems Engineer – Robotics & AI

What You Will Do

What We Are Looking For

Nice to Have

Why This Role