Researchers have developed a novel approach to training video-generating artificial intelligence systems that promises significant speedups in both development and deployment. The technique, called Next Forcing, addresses fundamental bottlenecks in how machines learn to predict and simulate realistic video sequences.
According to arXiv, a team led by Gangwei Xu and colleagues introduced a multi-chunk prediction framework designed to improve world action models, a class of AI systems trained to understand and forecast how environments change over time. These models have become central to robotics research and video synthesis applications.
The Core Problem
Existing video generation systems trained with autoregressive methods, which predict one frame or group of frames at a time, face a fundamental limitation. Training signals come only from the immediate next sequence, leaving the model without explicit guidance about dynamics further into the future. This constraint results in sluggish training convergence and diminished accuracy, particularly when generating video at high frame rates.
The inference stage presents its own challenges. Generating video sequentially through iterative denoising steps creates computational bottlenecks that slow real-time applications.
Multi-Horizon Learning Strategy
Next Forcing addresses these issues by borrowing concepts from recent advances in large language models that employ multi-token prediction. The framework augments the primary model with auxiliary lightweight modules that simultaneously denoise video segments at multiple future temporal depths, designated as next-1, next-2, and next-3 chunks.
These auxiliary modules form a causal chain where intermediate features from the main model feed predictions at successive depths. Critically, near-term predictions inform farther-horizon ones, creating dense supervision signals that propagate backward through the entire model during training.
Measured Improvements
Benchmark results demonstrate substantial gains. On robotics-focused tasks at 50 frames per second, Next Forcing achieved 93.1% relative improvement over prior methods at 5,000 training steps and reached convergence 2.3 times faster. The approach established new performance records on the RoboTwin benchmark, reaching 94.1% accuracy on clean data and 93.5% on random variations.
Beyond robotics, the method showed pronounced effectiveness in physics-aware video generation. Testing on PhyWorld, a benchmark measuring adherence to physical laws, revealed substantial improvements. General video pretraining tasks saw frame-level distortion metrics cut by more than 50%.
Inference Acceleration
During deployment, retaining the auxiliary modules enables parallel prediction of the next video segment while processing the current one. This architectural choice delivers twofold inference acceleration, a meaningful improvement for real-time applications requiring rapid video synthesis.
The research highlights a broader trend in machine learning: designing training objectives that provide richer supervision signals often yields models that train faster and perform better. By forcing the system to make predictions at multiple temporal scales simultaneously, researchers harness intermediate layer information that traditional approaches leave unused.
For the robotics and video synthesis communities, Next Forcing offers a practical pathway toward faster model development cycles and more responsive inference pipelines, potentially accelerating deployment of AI systems that understand and anticipate visual dynamics.
