Alice in Wonderland: Simple Tasks Showing Complete Reasoning Breakdown in State-Of-the-Art Large Language Models
Large Language Models (LLMs) like closed weights ones GPT-3.5/4, Claude, Gemini or open weights ones like LLaMa 2/3, Mistral, Mixtral, and more recent ones Dbrx or Command R+ are often described as being instances of foundation models - that is, models that transfer strongly across various tasks and conditions in few-show or zero-shot manner, while exhibiting scaling laws that predict function improvement when increasing the pre-training scale. These claims of excelling in different functions and tasks rely on measurements taken across various sets of standardized benchmarks showing high scores for such models. We demonstrate here a dramatic breakdown of function and reasoning capabilities of state-of-theart models trained at the largest available scales which claim strong function, using a simple, short, conventional common sense problem formulated in concise natural language, easily solvable by humans. The breakdown is dramatic, as models also express strong overconfidence in their wrong solutions, while providing often non-sensical "reasoning"-like explanations akin to confabulations to justify and backup the validity of their clearly failed responses, making them sound plausible.
Introduction. In the recent breakthroughs on transferable learning that were achieved in various classical domains of machine learning like visual recognition [1] or language understanding [2, 3, 4], large language models (LLMs) have played a very prominent role. Auto-regressive language modelling by next token prediction using causal mask losses was among the first successful approaches to self-supervised learning that was also scalable, both in terms of available web-scale text data and model scale [4]. The generic form and scalability of this type of learning allowed then to push towards training scales not achievable before with conventional supervised label-based learning, and provided glimpse on what happens on those larger scales. Scaling laws derived via training experiments on much smaller scales
Discussion / Conclusion. Using a very simple AIW problem formulation that can be easily solved by adults and arguably even children, we observe here a striking breakdown of SOTA LLMs performance when confronted with the task. This dramatic breakdown hints on serious deficits in basic reasoning capabilities in models that are widely claimed to possess strong function and reasoning skills, often citing their performance on a set of standardized benchmarks or the experience of various user groups or their creators. The overall breakdown and strong fluctuation of observed performance across variations of the same problem also hints at fundamental issues with the generalization capability of the models, which echoes and confirms concerns expressed in number of previous works [48, 13, 15]