Pixel Motion Diffusion is What We Need for Robot Control

DAWN

E-Ro Nguyen*, Yichi Zhang*, Kanchana Ranasinghe, Xiang Li, Michael S Ryoo
Stony Brook University
*Equal contribution
CVPR 2026
DAWN Framework

TL;DR: A two-stage diffusion framework where the Motion Director predicts dense pixel motion and the Action Expert converts it into executable robot actions.

Motion Director

Latent diffusion predicts dense pixel motion from observation plus language, giving an explicit motion intent representation.

Action Expert

Diffusion policy conditions on pixel motion, visuals, text, and robot state to generate coherent low-level action chunks.

Data Efficiency

Strong benchmark performance with limited robot data and compact model capacity, while preserving interpretability.

1. InputsVisual observations + language instruction + robot state.
2. Predict MotionGenerate dense pixel motion as structured intermediate dynamics.
3. Execute ActionsTranslate motion to robot control with a diffusion-based action policy.

Abstract

We present DAWN (Diffusion is All We Need for robot control), a unified diffusion-based framework for language-conditioned robotic manipulation that bridges high-level motion intent and low-level robot action via structured pixel motion representation. In DAWN, both the high-level and low-level controllers are modeled as diffusion processes, yielding a fully trainable, end-to-end system with interpretable intermediate motion abstractions. DAWN achieves state-of-the-art results on the challenging CALVIN benchmark, demonstrating strong multi-task performance, and further validates its effectiveness on MetaWorld. Despite the substantial domain gap between simulation and reality and limited real-world data, we demonstrate reliable real-world transfer with only minimal finetuning, illustrating the practical viability of diffusion-based motion abstractions for robotic control. Our results show the effectiveness of combining diffusion modeling with motion-centric representations as a strong baseline for scalable and robust robot learning.

Demonstrations

Bimanual Manipulation

Task Goal: clean the cutting board

Task Goal: fold clothes

Real World Demonstrations

See DAWN in action with real-world robotic applications and scenarios.

Task Goal: Lift a grape from the table

Video

Predicted Pixel Motion

Task Goal: Lift a kiwi from the table

Video

Predicted Pixel Motion

Task Goal: Lift an orange from the table

Video

Predicted Pixel Motion

CALVIN Demonstrations

See DAWN in action with CALVIN Benchmark.

Task Goal: lift blue block slider -> place in slider -> turn on lightbulb -> open drawer -> push pink block left

Video

Predicted Pixel Motion

Task Goal: rotate blue block left -> open drawer -> lift pink block table -> place in drawer -> turn on led

Video

Predicted Pixel Motion