ML Systems Engineer – Robotics & AI
We are building the full-stack foundation for the next generation of humanoid robots, from high-performance, software-defined hardware to the foundational models and video world models that control them. Our robots are designed to be generalists capable of operating in complex, real-world environments and handling scenarios unseen in training. We work at the intersection of large-scale learning, robotics, and systems, with a research team of top-tier experts from leading universities and institutions. We are creating a new computing platform for physical work, with significant funding invested aggressively into R&D, hardware development, and manufacturing scale-up.
We are hiring a Staff/Principal ML Systems Engineer to own training performance end-to-end and turn our training platform into a high-efficiency, high-reliability engine for research iteration.
What You Will Do
- Own performance at scale.
- Diagnose and improve end-to-end training performance for large models trained on multimodal robotic data including vision, proprioception, actions, language, and video.
- Build a repeatable workflow for performance attribution, including step-time breakdown, scaling curves, and bottleneck identification at different GPU counts.
- Drive measurable gains in distributed efficiency, compute efficiency, and memory efficiency.
- Make performance observable and durable through source-of-truth metrics, dashboards, and automated regression detection.
- Partner closely with researchers and research engineers to translate model changes into scalable implementations.
- Provide guidance on training strategy tradeoffs relevant to robotics world models.
- Reduce the operational burden on researchers so they can focus on model quality and robotic behavior.
- Collaborate on cluster efficiency with infrastructure teams to reduce wasted GPU-hours from stragglers, degraded nodes, network health issues, checkpoint stalls, and scheduler placement issues.
What We Are Looking For
- Significant experience delivering distributed training performance improvements in production research environments, with large-scale GPU training strongly preferred.
- Strong hands-on experience with modern training stacks such as PyTorch, with familiarity in JAX a plus.
- Deep understanding of distributed training concepts and tradeoffs including sharded training, tensor/pipeline parallelism, gradient accumulation, comm/compute overlap, and diagnosing collective communication performance.
- Strong debugging and measurement instincts with the ability to turn ambiguous performance issues into clear bottlenecks and validated fixes.
- Comfortable operating in a fast-moving startup environment with high ownership and minimal bureaucracy.
Nice to Have
- Experience with GPU kernel-level performance work including CUDA or Triton, fused ops, and compiler/graph capture.
- Experience with multimodal or video training and variable-length sequence packing or bucketing.
- Experience building observability systems for ML training including metrics, logs, traces, dashboards, and alerting.
- Familiarity with large-cluster scheduling or topology-aware placement in Slurm, Kubernetes, or HPC environments.
Why This Role
- Direct impact on model iteration speed, translating into faster research cycles and better robotic capability.
- Work at the frontier of large-scale training for real-world robotics, not toy benchmarks.
- Tight collaboration between systems, research, and infrastructure with no silos.
- High ownership in a small, ambitious team building foundational technology.
- Meaningful leverage, as improvements you make compound across every training run executed by the research team.

