A research team has unveiled a novel architecture for training robotic systems that directly incorporates 3D geometric understanding into policy learning, addressing a fundamental limitation in how current AI-driven robots perceive and interact with their physical environment.
The work, published on arXiv by researchers including those from ETH Zurich and Carnegie Mellon University, proposes the Geometric Action Model (GAM), which fundamentally rethinks how language-conditioned robot policies are structured. Rather than relying on 2D image representations or latent spaces derived from 2D data, GAM leverages a pretrained geometric foundation model as a unified computational backbone for perception, prediction, and action generation.
A New Approach to Robot Vision and Control
Current state-of-the-art robot learning systems, including vision-language-action models and video world-action models, inherit useful semantic and temporal information from large foundation models. However, these systems typically process information as 2D frames or 2D-derived representations, which obscures the three-dimensional spatial relationships critical for tasks requiring object manipulation and precise contact.
GAM addresses this gap by taking a geometric foundation model and strategically splitting it at an intermediate layer. The shallow portion functions as an observation encoder that processes visual input. A causal predictor is inserted at the split point to forecast future latent representations, conditioned on language instructions, proprioceptive feedback from the robot's joints, and historical action data. The predicted future tokens then flow through the remaining layers of the model, which generate both predicted future geometry and the actual robot actions needed.
According to arXiv, this design enables the geometric foundation model to perform language-conditioned temporal world modeling while requiring only minimal changes to its original architecture, thereby preserving the rich geometric priors the model learned during pretraining.
Performance Across Real and Simulated Tasks
The team evaluated GAM on a comprehensive set of benchmarks spanning both simulated environments and physical robot experiments. The results demonstrate substantial improvements over current foundation-model-scale baselines in four critical dimensions:
- Accuracy in predicting correct robot actions and outcomes
- Robustness when facing objects and configurations not seen during training
- Computational speed during inference
- Model size and memory requirements
The approach is particularly significant for contact-rich manipulation tasks, where understanding exact 3D geometry and spatial constraints becomes essential. Tasks like grasping irregularly shaped objects, inserting pegs into holes, or manipulating deformable materials rely heavily on accurate geometric reasoning that 2D representations struggle to capture.
Implications for Embodied AI
This work represents a meaningful shift in how researchers approach the challenge of building general-purpose robotic policies. By grounding the learning process in explicit geometric reasoning rather than treating it as an implicit byproduct of vision and language models, the researchers provide a cleaner path toward robots that can reliably execute complex physical instructions.
The efficiency gains are particularly noteworthy for practical deployment. Smaller, faster models that perform better could significantly reduce the computational resources required to run robot control systems, opening possibilities for edge deployment on less powerful hardware.
As robots move from controlled laboratory settings into unstructured real-world environments, the ability to deeply understand 3D geometry from language instructions becomes increasingly critical. GAM's approach suggests that the next generation of capable robot systems may depend less on scaling up massive foundation models and more on thoughtfully incorporating geometric structure into their core architectures.
