Why do some questions perform better without step-by-step reasoning?
Explores whether chain-of-thought prompting universally improves reasoning or if simpler prompts work better for certain questions. Understanding this matters because it challenges assumptions about how LLMs should be prompted to solve problems.
"Instance-adaptive Zero-shot Chain-of-Thought Prompting" (2024) uses neuron saliency score analysis to detect the mechanism underlying zero-shot CoT — why some prompts work for some instances and fail for others.
The finding: successful reasoning requires a specific information flow pattern across three components (question q, prompt p, rationale r). First, semantic information from the question must aggregate to the prompt. Then, reasoning steps must gather information from both the original question directly AND the synthesized question-prompt semantic information. When this flow is disrupted — when the prompt does not absorb question semantics, or when the rationale ignores the question — reasoning fails.
The practical consequence is striking: "Don't think. Just feel." — generally regarded as a less favorable prompt — outperforms "Let's think step by step" on some simple questions. The step-by-step prompt can guide the LLM into bad reasoning on questions that could be straightforwardly answered. This is not random noise; the saliency analysis shows WHY: for simple questions, the step-by-step prompt introduces unnecessary intermediate structure that disrupts the direct question-to-answer information flow.
This extends Why do chain-of-thought examples fail across different conditions? from exemplar-level brittleness to instance-level brittleness. The problem is not just that different exemplars produce different results — it's that the same prompt is fundamentally inappropriate for a subset of instances. Since When does explicit reasoning actually help model performance?, the instance-adaptive finding provides the information-flow mechanism: logical derivation tasks route well through the prompt-mediated pathway, while simpler or judgment-based tasks are disrupted by it.
The implication for reasoning model design: a single universal reasoning prompt is a design error. The optimal prompt depends on the specific question-prompt interaction, not on the task category. Since When should an agent actually stop and deliberate?, the instance-adaptive finding extends the principle from "when to deliberate" to "how to deliberate" — the form of reasoning must adapt to the question, not just the decision of whether to reason.
Inquiring lines that use this note as a source 91
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- What makes colorless green ideas fail where Jabberwocky succeeds?
- Does chain-of-thought text causally drive reasoning or merely reflect it?
- What prompt types best extract different aspects of item content?
- Why does chain-of-thought reasoning hurt recommendation tasks specifically?
- Do recency-focused prompts and in-context examples work equally well for order recovery?
- What makes the prompt a fundamentally new kind of speech act?
- How does prompt scaffolding shift invisible labor onto the user?
- What distinguishes genuine reasoning activation from memorization-assisted answer recall?
- When should action deliberation trigger during reasoning steps?
- Can manipulative prompts reduce reasoning model accuracy without fine-tuning?
- Can prompting unlock compositional skills that pretraining already learned?
- Why do chain-of-thought prompts work if reasoning is not systematic?
- How much does annotator style actually influence chain-of-thought prompting performance?
- Why do logically invalid chain-of-thought examples work nearly as well?
- Does each reasoning step in chain-of-thought introduce cumulative error?
- How much does prompt format shape what reasoning strategy a model uses?
- Do tokens beyond a critical threshold actually improve reasoning quality?
- Can prompting for specific creative paradigms improve ideation diversity?
- How do ordering effects compound across different prompt component scales?
- Why do practitioners default to prompting without recognizing its limits?
- When should an LLM engage extended reasoning versus responding directly?
- Why does joint optimization of prompts and inference strategy outperform separate tuning?
- Why does explicit reasoning degrade passage reranking performance?
- Why does ad-hoc prompt engineering violate scientific method standards?
- Do self-correction and chain-of-thought prompting reduce hallucination rates?
- Can forcing warrant checking through structured prompts improve LLM reasoning?
- How do autoregressive models constrain where chain-of-thought prompts can be positioned?
- Why do simple math problems get worse with longer reasoning chains?
- Why does step-by-step reasoning degrade performance on judgment-based tasks?
- Can we predict when a specific prompt will fail on a given question?
- How should reasoning prompts adapt based on question complexity and type?
- Do reasoning models trade instruction following for deliberative capability?
- Can RL teach when to use reasoning versus when to respond directly?
- Why do entities trigger memorized propositions instead of enabling reasoning?
- How does prompt design alter what kind of creativity LLMs can express?
- Does chain-of-thought reasoning specifically improve performance on metalinguistic tasks?
- Can LLMs improve at simple deduction through different training approaches?
- Can chain-of-thought reasoning be genuinely causal if exemplars don't need logic?
- Which structural properties of CoT prompts matter most for performance?
- Does chain-of-thought reasoning amplify bullshit or just make it more visible?
- Do chain-of-thought explanations reveal genuine reasoning or trigger latent features?
- How can prompting help models gather information before attempting reasoning?
- Can prompt engineering improve reasoning or only move requests into denser regions?
- Why does chain of thought reasoning fail across different prompt formats?
- Why do some prompts benefit from aggregation while others do not?
- Should benchmark evaluations use multiple prompt formulations for difficult tasks?
- How do exemplar properties affect the brittleness of chain-of-thought prompting?
- What methodological standards should prompting research papers meet before publication?
- What happens when prompter skill matters more than domain expertise?
- When should a system decide to retrieve versus reason alone?
- Why do format and structure matter more than actual content in reasoning?
- How do retrieval heads enable chain-of-thought reasoning to reference earlier context?
- Why do chain-of-thought outputs look logical but perform rhetorically?
- How do reasoning training methods sacrifice some thinking skills while improving others?
- Does chain-of-thought prompting overcome implicit meaning deficits in text analysis?
- Can inference budgets be allocated differently based on prompt difficulty?
- What other triggers can activate the latent reasoning capability?
- Can runtime interventions like meta-cognitive prompting work where training interventions fail?
- Do prompting technique improvements actually replicate in controlled experiments?
- How should inference budgets adapt based on prompt difficulty?
- When should a system choose extended thinking versus quick responses?
- How should timing for reasoning intervention be determined during inference?
- Why do some reasoning steps receive negligible attention from later steps?
- When should verification steps be prioritized over progression steps?
- When is detailed step-by-step reasoning actually counterproductive for solving a problem?
- What makes a first answer so often the best answer a model produces?
- Does chain-of-thought reasoning help or hurt social reasoning tasks?
- Do shorter reasoning chains maintain instruction adherence better than longer ones?
- Do reward reasoning models with chain-of-thought reasoning evaluate prompts better?
- What is the optimal balance between search rounds and reasoning depth per round?
- Can structured prompts reduce reasoning steps while improving financial accuracy?
- How can prompt intervention reduce redundant reasoning steps dynamically?
- How do prompting and activation steering relate as compression strategies?
- What makes extended chains more vulnerable than standard prompts?
- Can operationalizing theory into prompt structure improve reasoning more than theory itself?
- Do scheme critical questions work better than direct scheme classification prompts?
- What is the distinction between teaching reasoning how versus when to activate?
- What makes answer equivalence sufficient to discard a reasoning path?
- What prompting techniques actually replicate under controlled statistical testing?
- Why does prompting discover capabilities that need reward-driven refinement?
- Does argument-scheme prompting improve reasoning in non-code domains the same way?
- Can structured questioning prompts improve reasoning beyond standard conversational training?
- Why does prompt optimization alone fail to inject genuinely new knowledge?
- Should prompt design and inference scaling be optimized together or separately?
- Why does chain-of-thought work for math but fail for grounding?
- How does latent reasoning recursion compare to chain-of-thought reasoning?
- Why do external feature triggers outperform uncertainty on complex questions?
- Do widely-repeated prompting heuristics like politeness actually improve accuracy?
- Can indirect and direct reasoning methods be combined to improve results?
- Why do prompt effects reverse between different model generations?
- What other pragmatic prompt features have unstable effects?
Related concepts in this collection 4
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Why do chain-of-thought examples fail across different conditions?
Chain-of-thought exemplars show surprising sensitivity to order, complexity level, diversity, and annotator style. Understanding these brittleness dimensions could reveal what makes reasoning prompts robust or fragile.
extends brittleness from exemplar-level to instance-level; same prompt fails on different instances
-
When does explicit reasoning actually help model performance?
Explicit reasoning improves some tasks but hurts others. What determines whether step-by-step reasoning chains are beneficial or harmful for a given problem?
information-flow mechanism explains why: logical tasks route through prompt mediation, judgment tasks are disrupted
-
When should an agent actually stop and deliberate?
How can models detect when deliberation over action choices is genuinely needed versus wasteful? This matters because unbounded action spaces make universal deliberation intractable, yet skipping it entirely risks missing critical errors.
extends "when to think" to "how to think" for each instance
-
How much does demo position alone affect in-context learning accuracy?
Moving demonstrations from prompt start to end without changing their content produces surprisingly large accuracy swings. Does spatial position in the prompt matter more than what demonstrations actually contain?
another instance of prompt structure mattering more than content
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Skills-in-Context Prompting: Unlocking Compositionality in Large Language Models
- Invalid Logic, Equivalent Gains: The Bizarreness of Reasoning in Language Model Prompting
- Is Chain-of-Thought Reasoning of LLMs a Mirage? A Data Distribution Lens
- Reasoning Beyond Chain-of-Thought: A Latent Computational Mode in Large Language Models
- CDW-CoT: Clustered Distance-Weighted Chain-of-Thoughts Reasoning
- Large Language Models as an Indirect Reasoner: Contrapositive and Contradiction for Automated Reasoning
- Hierarchical Reasoning Model
- Zero-Shot Verification-guided Chain of Thoughts
Original note title
instance-adaptive prompting reveals that successful zero-shot CoT requires question-to-prompt information flow — some instances perform better without step-by-step reasoning