Schulze Buschoff, L. M., Akata, E., Bethge, M., & Schulz, E. (2025). Visual cognition in multimodal large language models. Nature Machine Intelligence, 1-11.
This paper evaluates the ability of multimodal large language models (LLMs) to emulate human-like cognitive abilities in intuitive physics, causal reasoning, and intuitive psychology. Through a series of controlled experiments involving visual question answering, the researchers tested several LLMs on tasks taken from cognitive science literature. While some models, particularly GPT-4V and Claude-3 Opus, demonstrated proficiency in processing and interpreting visual data and performed above chance in certain areas, none of them fully matched human-level performance or captured the nuances of human behavior in these domains. The study reveals significant shortcomings in the models' understanding of complex physical interactions, causal relationships, and social cognition, emphasizing the need for integrating more robust mechanisms for these abilities and developing cognitively inspired benchmarks for evaluating AI systems.
Share this post