INQUIRING LINE

Inquiring lines›How should agents manage and coord…›How effectively can inference-time…›Why do reasoning models fail at sy…›this inquiring line

Asking AI to 'show its work' actually makes it worse at building rules from partial, incomplete examples.

How does inductive reasoning from partial evidence enable hypothesis formation?

This explores how models build general rules or hypotheses from incomplete examples — and what the corpus reveals about why that move so often breaks down.

This explores how models build general rules or hypotheses from incomplete examples — and the corpus has a pointed answer: the machinery that's supposed to enable it often sabotages it. The cleanest finding is also the most deflating. When you actually test models on inferring rules from partial evidence — especially rules with exceptions — explicit reasoning models do *worse* than plain ones, scoring under 25% on exception rules versus 55–65% for non-reasoning models Why do reasoning models fail at exception-based rule inference?. Chain-of-thought, the very thing meant to help, introduces overgeneralization, hallucinated constraints, and math overuse that amplify mistakes precisely where negative evidence matters. So forming a hypothesis from partial data isn't bottlenecked by 'more deliberation' — verbalized reasoning can actively corrupt it.

Why? Two notes point at the same root. Models lean on *semantic* association rather than symbolic manipulation: decouple the meaning from the logical structure and performance collapses even when the correct rule sits right there in context Do large language models reason symbolically or semantically?. And on entailment specifically, predictions track whether a hypothesis was *attested* in training data rather than whether the premise actually supports it — feed a random premise and the model still affirms a familiar hypothesis Do LLMs predict entailment based on what they memorized?. Read together, these say genuine induction-from-partial-evidence is being shortcut by memory: the model recognizes a plausible conclusion instead of constructing one from the evidence in front of it.

The more hopeful thread reframes where hypothesis-forming ability actually lives. It isn't manufactured by training — it's *elicited*. Base models already carry latent reasoning that minimal intervention unlocks Do base models already contain hidden reasoning ability?, and RL post-training mostly teaches *when* to deploy reasoning, not how Does RL post-training create reasoning or just deploy it?. What generalizes that capacity is procedural knowledge absorbed broadly across pretraining documents, not narrow factual recall Does procedural knowledge drive reasoning more than factual retrieval? — which fits the inductive picture nicely: forming hypotheses is a transferable *procedure*, and you can even train it as a side effect of predicting arbitrary text Can models learn reasoning from predicting any text?.

The most interesting turn is what the corpus says about *holding* a hypothesis before committing to it — the real signature of inductive reasoning under uncertainty. Deterministic reasoners collapse to a single guess; making latent transitions stochastic lets a model represent a *distribution* over solutions and keep competing hypotheses alive Can stochastic latent reasoning let models explore multiple solutions?. And exploration matters more than depth: generating diverse abstractions forces breadth-first hypothesis search and avoids the 'underthinking' trap of drilling one chain too early Can abstractions guide exploration better than depth alone?. The quiet lesson here is that good induction looks less like a longer explanation and more like entertaining several candidate rules at once.

So the thing you didn't know you wanted to know: forming hypotheses from partial evidence is less about reasoning *harder* and more about two things the dominant chain-of-thought paradigm gets backwards — keeping uncertainty open instead of prematurely verbalizing one answer, and resisting the pull to affirm whatever conclusion already looks familiar.

Sources 9 notes

Why do reasoning models fail at exception-based rule inference?

Across four game-based tasks, reasoning models scored below 25% on exception rules versus 55–65% for non-reasoning models. Chain-of-thought introduces math overuse, overgeneralization, and hallucinated constraints that amplify errors in negative evidence recognition.

Do large language models reason symbolically or semantically?

When semantic content is decoupled from reasoning tasks, LLM performance collapses even with correct rules in context. Models rely on parametric commonsense and token associations rather than formal logical manipulation, constraining reasoning to training distribution semantics.

Do LLMs predict entailment based on what they memorized?

McKenna et al. (2023) identified attestation bias: LLMs predict entailment based on whether the hypothesis appears in training data, not whether the premise actually supports it. Random premise experiments show models maintain high entailment predictions when hypotheses are attested, proving they respond to memorized propositions rather than premise-hypothesis relationships.

Do base models already contain hidden reasoning ability?

Five independent mechanisms—RL steering, critique fine-tuning, decoding changes, SAE feature steering, and RLVR—all elicit reasoning already present in base model activations. Post-training selects rather than creates reasoning; the bottleneck is elicitation, not capability acquisition.

Does RL post-training create reasoning or just deploy it?

Evidence shows base models already contain reasoning capability in latent form; RL training optimizes deployment timing rather than capability creation. Hybrid models recover 91% of performance gains by routing tokens only, and activation vectors for reasoning strategies pre-exist before any RL.

Show all 9 sources

Does procedural knowledge drive reasoning more than factual retrieval?

Analysis of 5 million pretraining documents shows reasoning relies on broad, transferable procedural knowledge from diverse sources, unlike factual recall which depends on narrow, document-specific memorization of target facts.

Can models learn reasoning from predicting any text?

Quiet-STaR trains language models to generate rationales at every token position during pretraining on arbitrary internet text, enabling general reasoning without task-specific datasets. Rationale quality is judged by predictive accuracy rather than labeled correctness, allowing reasoning competence to emerge as a side effect of improved language modeling.

Can stochastic latent reasoning let models explore multiple solutions?

GRAM replaces deterministic latent updates with stochastic sampling, enabling models to represent probability distributions over solutions rather than single points. This lets recursive reasoners maintain uncertainty, explore alternatives, and handle ambiguous or multi-solution problems that deterministic single-path designs cannot.

Can abstractions guide exploration better than depth alone?

RLAD jointly trains abstraction and solution generators, showing that allocating test-time compute to diverse abstractions outperforms parallel solution sampling at large budgets. Abstractions create structured breadth-first exploration that prevents the underthinking failure mode of depth-only reasoning chains.

Papers this line draws on 8

The research behind the notes this line reads — ranked by how closely each paper relates.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst re-testing claims about how LLMs form hypotheses from partial evidence. The question remains open: what cognitive machinery actually supports inductive reasoning under incomplete data?

What a curated library found — and when (dated claims, not current truth):
Findings span 2023–2026. A library of ~12 papers reports:
• Explicit reasoning (chain-of-thought) *reduces* performance on exception-rule inference: <25% vs. 55–65% for non-reasoning baselines (2025).
• Models rely on semantic association and hypothesis attestation in training data rather than logical entailment from premises (~2024–2025).
• Base models possess latent reasoning; RL post-training teaches *deployment*, not acquisition of reasoning (2025).
• Stochastic latent transitions let models hold multiple competing hypotheses; deterministic reasoning collapses to single guesses (2025).
• Breadth-first abstraction exploration outperforms depth-first chain-of-thought on inductive tasks (2025).

Anchor papers (verify; mind their dates):
• arXiv:2505.24225 (2025) – Reasoning Can Hurt Inductive Abilities
• arXiv:2305.14825 (2023) – Semantic vs. Symbolic Reasoning
• arXiv:2411.12580 (2024) – Procedural Knowledge Drives Reasoning
• arXiv:2505.20296 (2025) – Reasoning as Solution Exploration

Your task:
(1) RE-TEST THE CONTRADICTION: Does newer model scale, in-context learning harnesses (e.g., structured prompts, retrieval-augmented setups), or reasoning-specific fine-tuning (e.g., RL on inductive tasks post-2025) *relax* the finding that explicit reasoning hurts inductive inference? Separately: does steering or activation-space intervention (2025 work) recover entailment without verbalization? Identify what still appears to hold vs. what newer methods have dissolved.
(2) Surface the strongest *disagreement*: find work claiming explicit reasoning (including chain-of-thought variants, tree search, or tool use) *does* improve partial-evidence hypothesis formation, published in the last 6 months. Reconcile or explain the contradiction.
(3) Propose two research questions that assume the regime may have shifted: (a) Can models learn to *suppress* premature verbalization while keeping reasoning latent? (b) Does inductive reasoning require a distinct training objective (e.g., predicting rule boundaries) rather than language modeling?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Asking AI to 'show its work' actually makes it worse at building rules from partial, incomplete examples.

Related lines of inquiry

Sources 9 notes

Papers this line draws on 8