A team of AI researchers has developed a targeted evaluation framework that isolates a critical weakness in today's most advanced vision-language models: the ability to remember and act on visual information that is no longer directly visible.
The research, published on arXiv, introduces RNG-Bench, a benchmark designed to measure how well multimodal large language models reconstruct prior observations during multi-step interactions. Unlike existing evaluation suites that either reveal complete environmental state or bundle memory reconstruction with other capabilities, RNG-Bench focuses specifically on separating memory performance from decision-making quality.
Two Games That Stress Test Memory
The benchmark includes two complementary games that probe different aspects of visual memory. Matching Pairs requires models to recall card identities briefly shown at specific locations, then correctly identify matching pairs later in an episode. The 3D Maze task challenges models to integrate first-person views into a coherent spatial map, demanding integration of sequential visual information into a unified representation.
Difficulty scales across three dimensions: grid size, visual pattern complexity, and observation modality. The most demanding configurations require processing approximately 128,000 tokens and 350 image inputs per episode, pushing beyond saturation points for current frontier models.
What the Data Reveals
According to arXiv research by Shengyuan Ding and colleagues, the benchmark introduces a "Memory Gap" metric that distinguishes between two failure modes: models forgetting prior observations versus models remembering information but making poor decisions. Analysis shows that most errors stem from degraded memory of earlier observations rather than suboptimal action selection given available context.
The researchers also implemented a head-to-head duel protocol to reduce variance from individual test instances, creating more reliable comparative measurements across different model variants.
A Path Toward Improvement
The team demonstrated that targeted fine-tuning can improve performance on these memory-intensive tasks. Training a 9-billion parameter version of Qwen on optimal policy demonstrations and filtered model rollouts improved RNG-Bench scores while maintaining performance on existing general-purpose benchmarks, suggesting that memory improvements need not come at the cost of broader capabilities.
- Current frontier models struggle with tasks requiring sustained visual memory across 100+ step episodes
- Memory degradation, not poor reasoning, accounts for most failures in sequential decision tasks
- Fine-tuning approaches can improve memory without sacrificing general multimodal understanding
The work highlights a gap between how vision-language models perform on static understanding tasks versus their ability to maintain and utilize visual state information during extended interactions. As these models move toward real-world deployment in robotics and autonomous systems, where closed-loop control depends on remembering observations from earlier steps, this memory limitation becomes increasingly consequential.
The benchmark itself is designed to support future research by providing a controlled environment where memory requirements scale cleanly and where researchers can pinpoint exactly where models falter in reconstructing the past.
