How does inductive reasoning from partial evidence enable hypothesis formation?
This explores how models build general rules or hypotheses from incomplete examples — and what the corpus reveals about why that move so often breaks down.
This explores how models build general rules or hypotheses from incomplete examples — and the corpus has a pointed answer: the machinery that's supposed to enable it often sabotages it. The cleanest finding is also the most deflating. When you actually test models on inferring rules from partial evidence — especially rules with exceptions — explicit reasoning models do *worse* than plain ones, scoring under 25% on exception rules versus 55–65% for non-reasoning models Why do reasoning models fail at exception-based rule inference?. Chain-of-thought, the very thing meant to help, introduces overgeneralization, hallucinated constraints, and math overuse that amplify mistakes precisely where negative evidence matters. So forming a hypothesis from partial data isn't bottlenecked by 'more deliberation' — verbalized reasoning can actively corrupt it.
Why? Two notes point at the same root. Models lean on *semantic* association rather than symbolic manipulation: decouple the meaning from the logical structure and performance collapses even when the correct rule sits right there in context Do large language models reason symbolically or semantically?. And on entailment specifically, predictions track whether a hypothesis was *attested* in training data rather than whether the premise actually supports it — feed a random premise and the model still affirms a familiar hypothesis Do LLMs predict entailment based on what they memorized?. Read together, these say genuine induction-from-partial-evidence is being shortcut by memory: the model recognizes a plausible conclusion instead of constructing one from the evidence in front of it.
The more hopeful thread reframes where hypothesis-forming ability actually lives. It isn't manufactured by training — it's *elicited*. Base models already carry latent reasoning that minimal intervention unlocks Do base models already contain hidden reasoning ability?, and RL post-training mostly teaches *when* to deploy reasoning, not how Does RL post-training create reasoning or just deploy it?. What generalizes that capacity is procedural knowledge absorbed broadly across pretraining documents, not narrow factual recall Does procedural knowledge drive reasoning more than factual retrieval? — which fits the inductive picture nicely: forming hypotheses is a transferable *procedure*, and you can even train it as a side effect of predicting arbitrary text Can models learn reasoning from predicting any text?.
The most interesting turn is what the corpus says about *holding* a hypothesis before committing to it — the real signature of inductive reasoning under uncertainty. Deterministic reasoners collapse to a single guess; making latent transitions stochastic lets a model represent a *distribution* over solutions and keep competing hypotheses alive Can stochastic latent reasoning help models explore multiple solutions?. And exploration matters more than depth: generating diverse abstractions forces breadth-first hypothesis search and avoids the 'underthinking' trap of drilling one chain too early Can abstractions guide exploration better than depth alone?. The quiet lesson here is that good induction looks less like a longer explanation and more like entertaining several candidate rules at once.
So the thing you didn't know you wanted to know: forming hypotheses from partial evidence is less about reasoning *harder* and more about two things the dominant chain-of-thought paradigm gets backwards — keeping uncertainty open instead of prematurely verbalizing one answer, and resisting the pull to affirm whatever conclusion already looks familiar.
Sources 9 notes
Across four game-based tasks, reasoning models scored below 25% on exception rules versus 55–65% for non-reasoning models. Chain-of-thought introduces math overuse, overgeneralization, and hallucinated constraints that amplify errors in negative evidence recognition.
When semantic content is decoupled from reasoning tasks, LLM performance collapses even with correct rules in context. Models rely on parametric commonsense and token associations rather than formal logical manipulation, constraining reasoning to training distribution semantics.
McKenna et al. (2023) identified attestation bias: LLMs predict entailment based on whether the hypothesis appears in training data, not whether the premise actually supports it. Random premise experiments show models maintain high entailment predictions when hypotheses are attested, proving they respond to memorized propositions rather than premise-hypothesis relationships.
Five independent mechanisms—RL steering, critique fine-tuning, decoding changes, SAE feature steering, and RLVR—all elicit reasoning already present in base model activations. Post-training selects rather than creates reasoning; the bottleneck is elicitation, not capability acquisition.
Evidence shows base models already contain reasoning capability in latent form; RL training optimizes deployment timing rather than capability creation. Hybrid models recover 91% of performance gains by routing tokens only, and activation vectors for reasoning strategies pre-exist before any RL.
Analysis of 5 million pretraining documents shows reasoning relies on broad, transferable procedural knowledge from diverse sources, unlike factual recall which depends on narrow, document-specific memorization of target facts.
Quiet-STaR trains language models to generate rationales at every token position during pretraining on arbitrary internet text, enabling general reasoning without task-specific datasets. Rationale quality is judged by predictive accuracy rather than labeled correctness, allowing reasoning competence to emerge as a side effect of improved language modeling.
GRAM replaces deterministic latent updates with stochastic sampling, enabling models to represent distributions over solutions rather than single predictions. This allows handling of ambiguous problems and multiple valid strategies that deterministic designs cannot represent.
RLAD jointly trains abstraction and solution generators, showing that allocating test-time compute to diverse abstractions outperforms parallel solution sampling at large budgets. Abstractions create structured breadth-first exploration that prevents the underthinking failure mode of depth-only reasoning chains.