DAWN

CVPR 2026

TL;DR: A two-stage diffusion framework where the Motion Director predicts dense pixel motion and the Action Expert converts it into executable robot actions.

Motion Director

Latent diffusion predicts dense pixel motion from observation plus language, giving an explicit motion intent representation.

Action Expert

Diffusion policy conditions on pixel motion, visuals, text, and robot state to generate coherent low-level action chunks.

Data Efficiency

Strong benchmark performance with limited robot data and compact model capacity, while preserving interpretability.

1. InputsVisual observations + language instruction + robot state.

2. Predict MotionGenerate dense pixel motion as structured intermediate dynamics.

3. Execute ActionsTranslate motion to robot control with a diffusion-based action policy.

Abstract

We present DAWN (Diffusion is All We Need for robot control), a unified diffusion-based framework for language-conditioned robotic manipulation that bridges high-level motion intent and low-level robot action via structured pixel motion representation. In DAWN, both the high-level and low-level controllers are modeled as diffusion processes, yielding a fully trainable, end-to-end system with interpretable intermediate motion abstractions. DAWN achieves state-of-the-art results on the challenging CALVIN benchmark, demonstrating strong multi-task performance, and further validates its effectiveness on MetaWorld. Despite the substantial domain gap between simulation and reality and limited real-world data, we demonstrate reliable real-world transfer with only minimal finetuning, illustrating the practical viability of diffusion-based motion abstractions for robotic control. Our results show the effectiveness of combining diffusion modeling with motion-centric representations as a strong baseline for scalable and robust robot learning.