SYNTHESIS NOTE

Why do some questions perform better without step-by-step reasoning?

Explores whether chain-of-thought prompting universally improves reasoning or if simpler prompts work better for certain questions. Understanding this matters because it challenges assumptions about how LLMs should be prompted to solve problems.

Synthesis note · 2026-03-28 · sourced from Prompts Prompting

"Instance-adaptive Zero-shot Chain-of-Thought Prompting" (2024) uses neuron saliency score analysis to detect the mechanism underlying zero-shot CoT — why some prompts work for some instances and fail for others.

The finding: successful reasoning requires a specific information flow pattern across three components (question q, prompt p, rationale r). First, semantic information from the question must aggregate to the prompt. Then, reasoning steps must gather information from both the original question directly AND the synthesized question-prompt semantic information. When this flow is disrupted — when the prompt does not absorb question semantics, or when the rationale ignores the question — reasoning fails.

The practical consequence is striking: "Don't think. Just feel." — generally regarded as a less favorable prompt — outperforms "Let's think step by step" on some simple questions. The step-by-step prompt can guide the LLM into bad reasoning on questions that could be straightforwardly answered. This is not random noise; the saliency analysis shows WHY: for simple questions, the step-by-step prompt introduces unnecessary intermediate structure that disrupts the direct question-to-answer information flow.

This extends Why do chain-of-thought examples fail across different conditions? from exemplar-level brittleness to instance-level brittleness. The problem is not just that different exemplars produce different results — it's that the same prompt is fundamentally inappropriate for a subset of instances. Since When does explicit reasoning actually help model performance?, the instance-adaptive finding provides the information-flow mechanism: logical derivation tasks route well through the prompt-mediated pathway, while simpler or judgment-based tasks are disrupted by it.

The implication for reasoning model design: a single universal reasoning prompt is a design error. The optimal prompt depends on the specific question-prompt interaction, not on the task category. Since When should an agent actually stop and deliberate?, the instance-adaptive finding extends the principle from "when to deliberate" to "how to deliberate" — the form of reasoning must adapt to the question, not just the decision of whether to reason.

Inquiring lines that read this note 92

This note is a source for these research framings, grouped by the broader line of inquiry each explores. Scan the bold lines of inquiry; follow any specific question forward.

Why do LLM research ideas score high on novelty yet collapse into low diversity?

What makes colorless green ideas fail where Jabberwocky succeeds?

Do reasoning traces faithfully represent or merely mimic actual model reasoning?

Can prompting inject entirely new knowledge into language models?

Can graph structure and relationships fundamentally improve recommendation systems?

Why does chain-of-thought reasoning hurt recommendation tasks specifically?

How do prompt structure and constraints affect model instruction reliability?

How do training data properties shape reasoning capability development?

How does latent reasoning compare to verbalized chain-of-thought?

How do adversarial and manipulative prompts attack reasoning models?

Can manipulative prompts reduce reasoning model accuracy without fine-tuning?

What actually drives chain-of-thought reasoning improvements in language models?

When do additional thinking tokens stop improving reasoning performance?

How faithfully do LLMs reflect their actual reasoning in outputs and explanations?

How should inference compute be adaptively allocated based on prompt difficulty?

Why do benchmark improvements fail to reflect actual reasoning quality?

Can prompting strategies overcome LLM biases without model fine-tuning?

Can language model hallucination be prevented or only managed?

Do self-correction and chain-of-thought prompting reduce hallucination rates?

What structural advantages do diffusion language models offer over autoregressive methods?

How do autoregressive models constrain where chain-of-thought prompts can be positioned?

What capability tradeoffs emerge when scaling model reasoning abilities?

Does reinforcement learning teach reasoning or just when to reason?

Can RL teach when to use reasoning versus when to respond directly?

Why do reasoning models fail at systematic problem-solving and search?

When should a system decide to retrieve versus reason alone?

Do language models perform faithful symbolic reasoning independent of semantic grounding?

Why do format and structure matter more than actual content in reasoning?

How do transformer attention mechanisms implement memory and algorithmic functions?

How do retrieval heads enable chain-of-thought reasoning to reference earlier context?

Why do language models struggle with implicit discourse relations?

Does chain-of-thought prompting overcome implicit meaning deficits in text analysis?

Do base models contain latent reasoning that training can unlock?

Why does verification consistently lag behind AI generation?

When should verification steps be prioritized over progression steps?

How should models express uncertainty rather than forced confident answers?

What makes a first answer so often the best answer a model produces?

How does reasoning effort affect AI theory of mind performance?

Does chain-of-thought reasoning help or hurt social reasoning tasks?

Why do correct reasoning traces tend to be shorter than incorrect ones?

What properties determine whether reward signals teach genuine reasoning?

Do reward reasoning models with chain-of-thought reasoning evaluate prompts better?

How should iterative research systems allocate reasoning per search step?

What is the optimal balance between search rounds and reasoning depth per round?

When should retrieval-augmented systems decide to fetch new information?

Why do external feature triggers outperform uncertainty on complex questions?

Related concepts in this collection 4

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

14 direct connections · 136 in 2-hop network ·dense cluster Open in graph ↗

Why do some questions perform better without ste… Why do chain-of-thought examples fail across diffe… When does explicit reasoning actually help model p… When should an agent actually stop and deliberate? How much does demo position alone affect in-contex…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Why do chain-of-thought examples fail across different conditions? Chain-of-thought exemplars show surprising sensitivity to order, complexity level, diversity, and annotator style. Understanding these brittleness dimensions could reveal what makes reasoning prompts robust or fragile.
extends brittleness from exemplar-level to instance-level; same prompt fails on different instances
When does explicit reasoning actually help model performance? Explicit reasoning improves some tasks but hurts others. What determines whether step-by-step reasoning chains are beneficial or harmful for a given problem?
information-flow mechanism explains why: logical tasks route through prompt mediation, judgment tasks are disrupted
When should an agent actually stop and deliberate? How can models detect when deliberation over action choices is genuinely needed versus wasteful? This matters because unbounded action spaces make universal deliberation intractable, yet skipping it entirely risks missing critical errors.
extends "when to think" to "how to think" for each instance
How much does demo position alone affect in-context learning accuracy? Moving demonstrations from prompt start to end without changing their content produces surprisingly large accuracy swings. Does spatial position in the prompt matter more than what demonstrations actually contain?
another instance of prompt structure mattering more than content

Why do some questions perform better without step-by-step reasoning?

Inquiring lines that read this note 92

Related concepts in this collection 4

Related papers in this collection 8

Search by related questions 4