Motion Director
Latent diffusion predicts dense pixel motion from observation plus language, giving an explicit motion intent representation.
TL;DR: A two-stage diffusion framework where the Motion Director predicts dense pixel motion and the Action Expert converts it into executable robot actions.
Latent diffusion predicts dense pixel motion from observation plus language, giving an explicit motion intent representation.
Diffusion policy conditions on pixel motion, visuals, text, and robot state to generate coherent low-level action chunks.
Strong benchmark performance with limited robot data and compact model capacity, while preserving interpretability.
We present DAWN (Diffusion is All We Need for robot control), a unified diffusion-based framework for language-conditioned robotic manipulation that bridges high-level motion intent and low-level robot action via structured pixel motion representation. In DAWN, both the high-level and low-level controllers are modeled as diffusion processes, yielding a fully trainable, end-to-end system with interpretable intermediate motion abstractions. DAWN achieves state-of-the-art results on the challenging CALVIN benchmark, demonstrating strong multi-task performance, and further validates its effectiveness on MetaWorld. Despite the substantial domain gap between simulation and reality and limited real-world data, we demonstrate reliable real-world transfer with only minimal finetuning, illustrating the practical viability of diffusion-based motion abstractions for robotic control. Our results show the effectiveness of combining diffusion modeling with motion-centric representations as a strong baseline for scalable and robust robot learning.
See DAWN in action with real-world robotic applications and scenarios.