HiAR: Efficient Autoregressive Long Video Generation via Hierarchical Denoising

YouTube

Official teaser: https://youtu.be/FMF-N-nuElc

Motivation figure for HiAR showing long-video drift reduction with hierarchical denoising

TLDR

HiAR changes autoregressive video diffusion from block-first denoising to a step-first hierarchical schedule. Each block conditions on earlier blocks at the same noise level, which reduces long-horizon drift while preserving temporal continuity. The same dependency pattern also enables pipelined parallel inference, and a forward-KL regulariser helps keep motion diverse during self-rollout distillation.

Abstract

Autoregressive (AR) diffusion offers a promising framework for generating videos of theoretically infinite length. However, a major challenge is maintaining temporal continuity while preventing the progressive quality degradation caused by error accumulation. To ensure continuity, existing methods typically condition on highly denoised contexts; yet, this practice propagates prediction errors with high certainty, thereby exacerbating degradation. In this paper, we argue that a highly clean context is unnecessary. Drawing inspiration from bidirectional diffusion models, which denoise frames at a shared noise level while maintaining coherence, we propose that conditioning on context at the same noise level as the current block provides sufficient signal for temporal consistency while effectively mitigating error propagation.

Building on this insight, we propose HiAR, a hierarchical denoising framework that reverses the conventional generation order: instead of completing each block sequentially, it performs causal generation across all blocks at every denoising step, so that each block is always conditioned on context at the same noise level. This hierarchy naturally admits pipelined parallel inference, yielding a ~1.8× wall-clock speedup in our 4-step setting. We further observe that self-rollout distillation under this paradigm amplifies a low-motion shortcut inherent to the mode-seeking reverse-KL objective. To counteract this, we introduce a forward-KL regulariser in bidirectional-attention mode, which preserves motion diversity for causal inference without interfering with the distillation loss. On VBench (20 s generation), HiAR achieves the best overall score and the lowest temporal drift among all compared methods.

Method

The key design choice in HiAR is to condition each block on earlier blocks at the same output noise level of the current denoising step, rather than on fully denoised context. This changes autoregressive generation from a block-first schedule to a hierarchical step-first schedule, reducing bias accumulation while preserving temporal causality.

Overview figure of HiAR hierarchical denoising and training pipeline — Overview of HiAR. Standard block-first autoregressive denoising amplifies drift, while hierarchical denoising uses matched-noise context and forward-KL regularisation to improve long-horizon generation.

Results

Qualitative comparison on 20-second generation.

Qualitative comparison of HiAR against other distilled autoregressive video models at 20 seconds — HiAR maintains colour fidelity, structural coherence, and visual stability over long rollouts, while baseline autoregressive methods accumulate visible drift.

BibTeX

No conference name is shown; the entry is kept in an under-review form.

@article{zou2026hiar,
  title={HiAR: Efficient Autoregressive Long Video Generation via Hierarchical Denoising},
  author={Zou, Kai and Zheng, Dian and Liu, Hongbo and Hang, Tiankai and Liu, Bin and Yu, Nenghai},
  journal={Under review},
  year={2026}
}