A new peer-reviewed study challenges a widespread assumption in artificial intelligence research: that training models on both images and text produces representations more aligned with how humans actually process language.

Researchers from multiple institutions conducted a controlled comparison of language-only and vision-language model pairs, using functional MRI brain scans and eye-tracking data from human readers to measure alignment with natural cognitive processes. According to arXiv, their findings indicate that multimodal pretraining does not provide consistent, broad advantages when processing text alone.

The work matters because it forces the AI research community to reconsider why multimodal models perform well on downstream tasks. The prevailing theory has held that exposure to both visual and linguistic information during training shapes internal representations in ways that better mirror human language comprehension. This study suggests that explanation may be incomplete.

Controlling for Visual Context

The research design was deliberately rigorous. Instead of comparing models in real-world settings where visual inputs might influence reasoning, the team evaluated both LLMs and VLMs on a text-only reading task. This isolation technique removed any online visual processing or cross-modal fusion, leaving only the learned biases from pretraining history.

Participants read natural text while researchers recorded whole-brain fMRI activity and tracked eye movements in synchronized fashion. These measurements provided two independent windows into how closely each model's internal states resembled actual human neural and behavioral patterns during reading.

Selective Rather Than Universal Benefits

Controlling for Visual Context
Photo by Adem Percem on Pexels.

The findings paint a nuanced picture. Multimodal pretraining did not universally improve alignment with human brain activity or reading patterns across all text types. Instead, the advantage emerged selectively: VLMs showed better alignment specifically when sentences carried strong visual semantic content.

  • Brain imaging and eye-tracking data converged on this pattern
  • Language-internal representations remained the primary driver of human-like text processing
  • Visual learning history contributed incrementally rather than transformatively

This distinction matters for model development. It suggests that claims about multimodal training producing fundamentally more human-aligned language systems may overstate the evidence. The neural representations that guide comprehension of abstract concepts, temporal relationships, and non-visual semantic content appear shaped primarily by linguistic structure rather than multimodal exposure.

Implications for Model Design

The results provide what the researchers describe as a controlled experimental framework for testing how visual learning history influences model-to-human alignment. Rather than settling a debate, the work opens new questions about which architectural choices and training approaches actually map onto human cognition.

For practitioners, the takeaway is clear: multimodal pretraining should not be treated as a universal enhancement for language understanding. Its benefits are context dependent. Models designed primarily for text processing may not gain meaningful human-alignment improvements from vision pretraining, while models handling visually rich domains might see targeted gains.

"Language-internal representations remain the key factor for modeling human text processing," the authors conclude, emphasizing that linguistic structure and semantics drive the alignment more than cross-modal exposure.

This research adds empirical rigor to ongoing debates about model architecture and training objectives. As the field moves toward ever-larger multimodal systems, understanding precisely where visual pretraining creates value becomes essential for efficient resource allocation and realistic capability assessment.