INQUIRING LINE

Can small demonstration sets unlock general reasoning without large question data?

This explores whether a handful of worked examples (or even one) can switch on broad reasoning ability, instead of needing huge labeled training sets — and what that implies about where reasoning actually lives.


This explores whether a handful of worked examples (or even a single one) can switch on broad reasoning ability without the usual mountain of question-answer data. The short version from the corpus: yes, surprisingly often — but only because the reasoning was already there, latent, waiting to be elicited rather than taught. The most striking data point is that one carefully chosen training example in an RLVR setup can lift math performance from 36% to 73.6%, and keep improving test accuracy for 1,400 steps after the model has already nailed the training set perfectly Can a single training example unlock mathematical reasoning?. That doesn't look like learning a skill; it looks like flipping a switch.

The reason small sets work is the punchline of a broader thread: base models already contain the reasoning, and post-training mostly *selects* it rather than creating it. One synthesis found five completely different mechanisms — RL steering, critique fine-tuning, decoding tweaks, sparse-feature steering, and RLVR — all eliciting the same pre-existing capability, concluding the bottleneck is elicitation, not acquisition Do base models already contain hidden reasoning ability?. You can even skip training entirely: four modular 'cognitive tools' implemented as sandboxed prompts pushed GPT-4.1 on AIME from 26.7% to 43.3% with no reinforcement learning at all Can modular cognitive tools unlock reasoning without training?. If a prompt scaffold alone unlocks that much, the capability clearly wasn't being installed by data.

Where does the latent ability come from, if not from question sets? The corpus points back to pretraining itself. An analysis of five million pretraining documents found that reasoning draws on broad, transferable *procedural* knowledge spread across many sources — the 'how to do this kind of step' — which is the opposite of factual recall, which depends on narrowly memorizing specific documents Does procedural knowledge drive reasoning more than factual retrieval?. That's why a tiny demonstration set can generalize: it's activating a procedure the model already absorbed, not teaching it from scratch. Quiet-STaR pushes this further, learning to generate rationales at every token while reading arbitrary internet text, so general reasoning emerges as a side effect of better language modeling rather than from any task-specific dataset Can models learn reasoning from predicting any text?.

The sharp caveat is that 'unlock' and 'fake it' look identical on a benchmark, so small-data methods deserve suspicion about *what* they activate. Supervised fine-tuning can raise final-answer accuracy while actually degrading the quality of the reasoning steps by ~39%, producing correct answers via post-hoc rationalization that standard metrics never catch Does supervised fine-tuning improve reasoning or just answers?. And chain-of-thought trained on narrow data degrades predictably the moment the task, length, or format shifts — fluent reasoning form without valid underlying logic Does chain-of-thought reasoning actually generalize beyond training data?. So small demonstration sets can genuinely elicit general reasoning, but they can just as easily elicit a convincing *imitation* of it; the difference only shows up off-distribution.

If you want to follow the thread further, two adjacent ideas loosen the data requirement from other angles: verifier-free RL replaces answer-checking with the likelihood the model assigns to a reference answer, extending reasoning RL into general domains without rule-based verifiers or large labeled sets Can reasoning improvement work without answer verification?, and energy-based transformers reach System-2-style deliberation from unsupervised learning alone, generalizing out-of-distribution without any domain-specific scaffolding Can energy minimization unlock reasoning without domain-specific training?. The common lesson across all of them: the expensive thing was never the question data — it was the pretraining that planted the capability the small set wakes up.


Sources 9 notes

Can a single training example unlock mathematical reasoning?

A single example in RLVR boosts math performance from 36% to 73.6% and enables test accuracy to improve for 1,400 steps after training accuracy reaches 100%, revealing that minimal activation signals unlock latent reasoning capability.

Do base models already contain hidden reasoning ability?

Five independent mechanisms—RL steering, critique fine-tuning, decoding changes, SAE feature steering, and RLVR—all elicit reasoning already present in base model activations. Post-training selects rather than creates reasoning; the bottleneck is elicitation, not capability acquisition.

Can modular cognitive tools unlock reasoning without training?

Four cognitive tools implemented as sandboxed LLM calls improved GPT-4.1 on AIME2024 from 26.7% to 43.3% without any RL training. Modularity enforces operation isolation that pure prompting cannot guarantee, eliciting pre-existing reasoning capability.

Does procedural knowledge drive reasoning more than factual retrieval?

Analysis of 5 million pretraining documents shows reasoning relies on broad, transferable procedural knowledge from diverse sources, unlike factual recall which depends on narrow, document-specific memorization of target facts.

Can models learn reasoning from predicting any text?

Quiet-STaR trains language models to generate rationales at every token position during pretraining on arbitrary internet text, enabling general reasoning without task-specific datasets. Rationale quality is judged by predictive accuracy rather than labeled correctness, allowing reasoning competence to emerge as a side effect of improved language modeling.

Does supervised fine-tuning improve reasoning or just answers?

Supervised fine-tuning improves final-answer accuracy on benchmarks but cuts Information Gain by 38.9 percent, meaning models generate correct answers through post-hoc rationalization rather than genuine inferential steps. Standard metrics miss this degradation because they only measure final correctness.

Does chain-of-thought reasoning actually generalize beyond training data?

DataAlchemy experiments show CoT fails systematically under distributional shifts in task, length, and format. Models produce fluent but logically inconsistent reasoning — imitating reasoning form without valid underlying logic.

Can reasoning improvement work without answer verification?

VeriFree bypasses answer verification entirely by using the conditional probability of reference answers given generated reasoning traces as both reward signal and training weight. This approach matches or surpasses verifier-based methods on MMLU-Pro, GPQA, and SuperGPQA without rule-based or model-based verifiers.

Can energy minimization unlock reasoning without domain-specific training?

Energy-Based Transformers assign energy values to input-prediction pairs and use gradient descent minimization for inference, yielding 35% higher training scaling rates and 29% more inference-compute gains than Transformer++, while generalizing better on out-of-distribution data without domain-specific scaffolding.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are an AI research analyst. The question: **Can small demonstration sets unlock general reasoning without large question data?** — treat this as still-open, and re-evaluate the evidence against the latest models, methods, and eval practices.

**What a curated library found — and when (dated claims, not current truth):**
Findings span Feb 2024–Oct 2025. Key constraints and findings:
- One carefully chosen training example in RLVR lifted math performance from 36% → 73.6%, improving test accuracy 1,400 steps past perfect train-set fit (~2025).
- Five distinct mechanisms (RL steering, critique fine-tuning, decoding, sparse-feature steering, RLVR) all elicit the same latent pre-existing capability; bottleneck is elicitation, not acquisition (~2025).
- Four modular cognitive tools (sandboxed prompts, no RL) pushed GPT-4.1 on AIME from 26.7% → 43.3% (~2025).
- Supervised fine-tuning can raise final-answer accuracy while degrading reasoning-step quality by ~39%; correct answers via post-hoc rationalization, not valid logic (~2025).
- Chain-of-thought trained on narrow data degrades predictably off-distribution (~2025).
- Procedural knowledge spread across pretraining drives generalization; reasoning is not task-specific memorization (~2024–11).
- Quiet-STaR learns rationale generation at token level during language modeling, not task-specific training (~2024–03).

**Anchor papers (verify; mind their dates):**
- arXiv:2504.20571 (2025–04): One Training Example sufficient in RLVR.
- arXiv:2510.07364 (2025–10): Base Models Know How to Reason, Thinking Models Learn When.
- arXiv:2411.12580 (2024–11): Procedural Knowledge in Pretraining Drives Reasoning.
- arXiv:2506.12115 (2025–06): Cognitive Tools for Reasoning.

**Your task:**
(1) **RE-TEST EACH CONSTRAINT.** For every finding above, assess whether recent models (GPT-4o, o1-series, Claude 3.5+), RL-at-scale practices, or refined evaluations (on-distribution vs. distribution-shift benchmarks) have relaxed or overturned the constraints. Separate the durable question ("Is reasoning latent or learned?") from perishable limitations ("Small sets fail on AIME"). Cite what resolved or confirmed each claim.
(2) Surface the **strongest contradicting or superseding work** from the last 6 months — especially any finding that small sets *do* require large pre-training, or that reasoning is not pre-latent but constructed during inference.
(3) Propose **two research questions** assuming the regime has shifted: e.g., "Do recent verifier-free RL approaches scale to domains where latent reasoning is weaker?" and "Can distribution-shift evaluations now distinguish genuine reasoning from convincing imitation in small-data regimes?"

**Cite arXiv IDs; flag anything you cannot ground in a real paper.**

Next inquiring lines