INQUIRING LINE

Can models learn to select exemplars based on reasoning skills rather than complexity?

This explores whether a model can pick which training or in-context examples to learn from based on the *reasoning skill* they exercise — the procedures, decision points, and forms of inference — rather than by surface difficulty or problem length.


This explores whether a model can pick which training or in-context examples to learn from based on the *reasoning skill* they exercise rather than by surface difficulty or problem length. The corpus suggests the question rests on a deeper premise: that "complexity" is the wrong axis entirely. One finding shows that reasoning models don't actually break down at complexity thresholds — they break down at *instance unfamiliarity*. Models fit instance-level patterns rather than general algorithms, so a long, hairy reasoning chain succeeds if the model has seen similar instances, while a short, simple one fails if it hasn't Do language models fail at reasoning due to complexity or novelty?. If failure tracks novelty and not difficulty, then selecting exemplars by complexity is optimizing the wrong signal in the first place.

So what *is* the load-bearing property of a good exemplar? Several notes converge on a surprising answer: it's the *form* and *procedure* of reasoning, not its correctness or rigor. Logically invalid chain-of-thought prompts perform nearly as well as valid ones, because the model learns the structural shape of reasoning rather than genuine inference Does logical validity actually drive chain-of-thought gains?. Even deliberately corrupted traces teach as well as correct ones — they act as computational scaffolding rather than meaningful steps Do reasoning traces need to be semantically correct?. And what generalizes across reasoning isn't memorized facts but transferable *procedural knowledge* drawn from diverse pretraining sources Does procedural knowledge drive reasoning more than factual retrieval?. Together these suggest that if you wanted to select exemplars by "reasoning skill," the skill that matters is procedural structure — and it can be present even when the example is wrong.

The corpus also shows models *can* learn to route and select based on reasoning demands, not difficulty labels. Thinkless trains a single model to decide when to engage extended thinking versus answer directly, using a decoupled RL objective that learns this self-calibrated routing *without* explicit difficulty labels Can models learn when to think versus respond quickly?. That's a model selecting based on the reasoning a problem requires rather than a complexity score. In a related vein, the critical learning signal lives in a small minority of high-entropy "forking" tokens — the pivotal decision points — and training on just that ~20% matches full updates Do high-entropy tokens drive reasoning model improvements?. Reasoning skill, it turns out, is concentrated in specific decision moments, not spread evenly across a problem's surface complexity.

There's a deeper reason skill-based selection should work: the reasoning is often already latent. Five independent mechanisms all elicit reasoning that base models already contain — post-training *selects* rather than creates it Do base models already contain hidden reasoning ability?. A single well-chosen RLVR example can jump math accuracy from 36% to 73.6% and keep improving generalization for over a thousand steps past training saturation Can a single training example unlock mathematical reasoning?. If one example can activate a whole capability, then *which* example you select matters enormously — and the activating property is the reasoning behavior it triggers, not how hard it looks.

The sharpest practical lesson comes from work on argument quality and question-asking: models trained on labeled examples alone learn surface patterns, not principled criteria, and only generalize when quality is *decomposed into explicit, theory-grounded attributes* Can models learn argument quality from labeled examples alone? Can models learn to ask genuinely useful clarifying questions?. The implication for your question is direct: "reasoning skill" isn't a single scalar a model can sort by — it has to be broken into named sub-skills (the procedure used, the decision points exercised, the instance patterns covered) before selection becomes tractable. Complexity is one number and easy to sort by; reasoning skill is a structured object, which is exactly why decomposition keeps showing up as the thing that makes it learnable.


Sources 10 notes

Do language models fail at reasoning due to complexity or novelty?

LRMs don't break at complexity thresholds but at instance-novelty boundaries. Models fit instance-based patterns rather than generalizable algorithms, so any reasoning chain succeeds if trained on similar instances, regardless of length.

Does logical validity actually drive chain-of-thought gains?

Illogical chain-of-thought exemplars matched valid CoT performance on BIG-Bench Hard, showing that structural properties—not logical validity—drive the gains. The model learns the form of reasoning, not genuine inference.

Do reasoning traces need to be semantically correct?

Models trained on systematically irrelevant traces maintain solution accuracy and sometimes improve out-of-distribution generalization, suggesting traces function as computational scaffolding rather than meaningful reasoning steps.

Does procedural knowledge drive reasoning more than factual retrieval?

Analysis of 5 million pretraining documents shows reasoning relies on broad, transferable procedural knowledge from diverse sources, unlike factual recall which depends on narrow, document-specific memorization of target facts.

Can models learn when to think versus respond quickly?

Thinkless trains a single model to select between extended reasoning and direct responses using DeGRPO, which decouples mode selection from answer refinement. This prevents mode collapse and enables self-calibrated routing without explicit difficulty labels.

Do high-entropy tokens drive reasoning model improvements?

Only ~20% of tokens exhibit high entropy as pivotal reasoning decision points; RLVR primarily adjusts these forking tokens. Training exclusively on them matches or exceeds full-gradient performance, revealing that the minority carries the learning signal.

Do base models already contain hidden reasoning ability?

Five independent mechanisms—RL steering, critique fine-tuning, decoding changes, SAE feature steering, and RLVR—all elicit reasoning already present in base model activations. Post-training selects rather than creates reasoning; the bottleneck is elicitation, not capability acquisition.

Can a single training example unlock mathematical reasoning?

A single example in RLVR boosts math performance from 36% to 73.6% and enables test accuracy to improve for 1,400 steps after training accuracy reaches 100%, revealing that minimal activation signals unlock latent reasoning capability.

Can models learn argument quality from labeled examples alone?

Fine-tuning on labeled examples fails to transfer quality criteria to new argument types. Models learn surface patterns rather than principled criteria. Explicit instruction using frameworks like RATIO or QOAM significantly improves performance and generalization.

Can models learn to ask genuinely useful clarifying questions?

The ALFA framework breaks down question quality into theory-grounded attributes (clarity, relevance, specificity) and trains models on 80K attribute-specific preference pairs. Attribute-specific optimization outperforms single-score training, especially in clinical reasoning where asking the right clarifying question directly impacts decision quality.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a reasoning-capability researcher re-testing whether models can select exemplars by *reasoning skill* rather than complexity—a question that may have shifted since early 2025. A curated library (spanning 2023–2026) found:

**What a curated library found — and when (dated claims, not current truth):**
- Reasoning failures track *instance-level unfamiliarity*, not task complexity; complexity-based exemplar selection optimizes the wrong signal (~2026).
- Logically invalid and deliberately corrupted reasoning traces teach as well as correct ones; models learn procedural *form*, not semantic validity (~2023–2024).
- Procedural knowledge in pretraining—not memorized facts—drives reasoning generalization across domains (~2024–2025).
- Models can learn to route between extended thinking and direct response using RL without explicit difficulty labels; skill-based routing is learnable (~2025).
- High-entropy "forking" tokens (pivotal decision points) concentrate learning signal; ~20% of tokens drive reasoning capability (~2025).
- One well-chosen RLVR example can lift math accuracy from 36% to 73.6% and sustain generalization >1000 steps past saturation (~2025).
- "Reasoning skill" only becomes selectable once decomposed into explicit, theory-grounded sub-attributes; monolithic quality scores don't generalize (~2024–2025).

**Anchor papers (verify; mind their dates):**
- arXiv:2307.10573 (2023): Invalid Logic, Equivalent Gains
- arXiv:2411.12580 (2024): Procedural Knowledge in Pretraining
- arXiv:2505.13379 (2025): Thinkless—skill-based routing
- arXiv:2504.20571 (2025): One Training Example suffices

**Your task:**
(1) RE-TEST each constraint. For instance-level unfamiliarity: do scaling, synthetic data generation, or few-shot adaptation now override novelty effects? For form-vs-validity: have recent models begun preferring *correct* traces in selection, or do invalid examples still teach comparably? For the ~20% critical tokens: does attention analysis or recent RL post-training refine *which* tokens matter? Separate the durable question (what *property* makes an exemplar activating?) from the perishable claim (that validity is irrelevant). Cite what relaxed it.
(2) Surface the strongest work from the last 6 months contradicting skill-based selection—e.g., evidence that complexity-aware, difficulty-scaled curricula outperform procedurally-grounded selection.
(3) Propose 2 research questions assuming the regime may have moved: (a) Can models learn to *decompose and weight* reasoning sub-skills dynamically per problem, or is the decomposition fixed? (b) Does skill-based exemplar selection transfer across model scale / architecture, or is it regime-specific?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines