Pixels, Patterns, but No Poetry: To See The World like Humans
Achieving human-like perception and reasoning in Multimodal Large Language Models (MLLMs) remains a central challenge in artificial intelligence. While recent research has primarily focused on enhancing reasoning capabilities in MLLMs, a fundamental question persists: Can Multimodal Large Language Models truly perceive the world as humans do? This paper shifts focus from reasoning to perception. Rather than constructing benchmarks specifically for reasoning, we introduce the Turing Eye Test (TET), a challenging perception-oriented benchmark comprising four diagnostic tasks that evaluate MLLMs’ performance on synthetic images that humans process intuitively. Our findings reveal that stateof-the-art MLLMs exhibit catastrophic failures on our perceptual tasks trivial for humans. Both in-context learning and training on language backbone—effective for previous benchmarks—fail to improve performance on our tasks, while finetuning the vision tower enables rapid adaptation, suggesting that our benchmark poses challenges for vision tower generalization rather than for the knowledge and reasoning capabilities of the language backbone—a key gap between current MLLMs and human perception. This is a preliminary version that only contains a subset of TET tasks.
Introduction. Large Language Models (LLMs) have demonstrated powerful capabilities across various tasks (Radford et al., 2018; 2019; Brown et al., 2020; Achiam et al., 2023). This breakthrough has catalyzed the development of multimodal architectures that extend beyond text to encompass visual understanding. Compared to LLMs, the main characteristic of Multimodal Large Language Models (MLLMs) is their ability to directly recognize and understand images (Liu et al., 2023; 2024a;b). The most popular method is to integrate a vision encoder with a language model (Li et al., 2023; Chen et al., 2024b;a); this process works by first projecting the image’s features into the language model’s embedding space, allowing the model to process both visual and text inputs seamlessly to generate a coherent response.
Discussion / Conclusion. In this study, we introduced the Turing Eye Test (TET), a perception-oriented benchmark that reveals fundamental limitations in current Multimodal Large Language Models’ visual understanding capabilities. Through four diagnostic tasks involving concealed text, 3D Captchas, Chinese character compositions, and color blind test charts, we demonstrated that state-of-the-art MLLMs exhibit catastrophic failures on perceptual tasks that humans solve intuitively. Our analysis reveals that these failures stem from limitations in the vision tower’s generalization abilities rather than deficiencies in language reasoning or knowledge. While in-context learning and language backbone fine-tuning proved ineffective, targeted fine-tuning of the vision tower enabled rapid adaptation, highlighting a critical gap between current MLLM architectures and human-like visual perception. These findings underscore the need for improved visual generalization methods in MLLMs and establish TET as a valuable diagnostic tool for evaluating genuine perceptual capabilities beyond traditional reasoningfocused benchmarks.