Break the Chain: Large Language Models Can be Shortcut Reasoners
Recent advancements in Chain-of-Thought (CoT) reasoning utilize complex modules but are hampered by high token consumption, limited applicability, and challenges in reproducibility. This paper conducts a critical evaluation of CoT prompting, extending beyond arithmetic to include complex logical and commonsense reasoning tasks, areas where standard CoT methods fall short. We propose the integration of human-like heuristics and shortcuts into language models (LMs) through "break the chain" strategies. These strategies disrupt traditional CoT processes using controlled variables to assess their efficacy. Additionally, we develop innovative zero-shot prompting strategies that encourage the use of shortcuts, enabling LMs to quickly exploit reasoning clues and bypass detailed procedural steps. Our comprehensive experiments across various LMs, both commercial and open-source, reveal that LMs maintain effective performance with "break the chain" strategies. We also introduce ShortcutQA, a dataset specifically designed to evaluate reasoning through shortcuts, compiled from competitive tests optimized for heuristic reasoning tasks such as forward/backward reasoning and simplification.
Introduction. In the evolving landscape of artificial intelligence, the ability to reason and solve complex problems symbolizes a cornerstone of intelligence. Language Models (LMs), particularly those based on transformer (Vaswani et al., 2017) architectures, have revolutionized our approach to natural language processing (NLP), significantly enhancing capabilities in comprehending and generating text that Among recent advancements, Chain-of-Thought (CoT) prompting has emerged as a pivotal technique for utilizing Large Language Models (LLMs) to address complex reasoning tasks. By methodically eliciting step-by-step reasoning, CoT prompting has significantly enhanced the problem-solving capabilities of LLMs across a variety of learning scenarios, including few-shot (Wei et al., 2022) and zero-shot contexts (Kojima et al., 2022a). Figure 1 illustrates a zero-shot example in which the Chat- GPT model methodically resolves a mathematical question.
Discussion / Conclusion. Our experimental results corroborate the theoretical predictions, as illustrated in Figure 3. We observe that CoT accuracy generally declines as chain length increases. Notably, in scenarios like Coin Flip where P(it) approaches 1, accuracy remains stable regardless of chain length. Conversely, in tasks like SVamp where P(it) is lower, a decrease in accuracy is noted as the chain lengthens. When comparing "Quick Conclude" on SVamp against baseline accuracies, the relative CoT accuracy diminishes with increasing chain length, aligning precisely with our model. Detailed methodologies for these experiments are available in Appendix D. This study critically evaluates Chain-of-Thought (CoT) reasoning in language models, highlighting limitations such as high token consumption and limited applicability. Our "break the chain" strategies integrate human-like heuristics and shortcuts, enhancing efficiency without compromising performance across various models.