INQUIRING LINE

Inquiring lines›How should agents manage and coord…›How can training approaches develo…›How do training data properties sh…›this inquiring line

Getting better at one type of math doesn't always help with another — and knowing why changes everything.

Can mathematical reasoning improvements transfer across problem subdomains?

This explores whether a model that gets better at one kind of math problem actually carries that improvement into other problem areas — and the corpus suggests the answer depends entirely on whether the new area is bottlenecked by reasoning or by knowledge.

This explores whether reasoning gains in one math area spill over into others — and the collection splits the question into two very different cases. When the target area shares the same underlying structure, transfer is real and even cheap to trigger: a single training example can activate latent mathematical reasoning and keep improving test accuracy long after training has saturated Can a single training example unlock mathematical reasoning?, which only makes sense if the capability was already present and waiting to be switched on rather than taught from scratch. Training on formal-language scaffolds like Prolog and PDDL pushes this further, lifting logical reasoning, planning, and general reasoning together — but the gains concentrate on problems that are *structurally similar* to the prototypes Do formal language prototypes improve reasoning across different domains?. So 'subdomain' transfer works best when the subdomains are really the same shape underneath.

The sharp boundary appears when you cross into knowledge-heavy territory. Reasoning-distilled models fail to beat their base versions on medical tasks, because in medicine what limits performance is whether the model *knows the fact*, not whether it can reason cleanly about it — the opposite of math Why doesn't mathematical reasoning transfer to medicine?. There's a mechanistic story underneath: knowledge retrieval seems to live in lower network layers while reasoning adjustment happens in higher ones, so reasoning-focused training improves math while actively degrading domains that depend on stored knowledge Why does reasoning training help math but hurt medical tasks?. The thing you improved and the thing the new domain needs aren't the same thing.

There's also a quieter reason 'transfer' often disappoints: a lot of what looks like reasoning is really proximity to the training distribution. Chain-of-thought degrades predictably once you shift the task, length, or format — producing fluent prose that's logically inconsistent Does chain-of-thought reasoning actually generalize beyond training data?. Even the length of a reasoning trace turns out to track how close a problem sits to training data rather than how hard it actually is Does longer reasoning actually mean harder problems?. And some headline benchmark gains are partly memorization rather than reasoning at all: a model can reconstruct half of a contaminated math benchmark from partial prompts yet score zero on a clean post-release one Does RLVR success on math benchmarks reflect genuine reasoning improvement?. If the 'improvement' was recall, there was never anything portable to transfer.

One hopeful counter-thread: you may not need the new domain to be verifiable for reasoning training to reach it. VeriFree replaces answer-checking with the likelihood of a reference answer given the reasoning trace, and matches verifier-based methods on broad knowledge benchmarks like MMLU-Pro and GPQA Can reasoning improvement work without answer verification? — a route to extending reasoning RL into subdomains where you can't write a clean checker.

The thing you didn't know you wanted to know: transfer isn't blocked by distance between *topics* but by a mismatch in *what bottlenecks each topic*. Math-to-math transfers because both are reasoning-limited; math-to-medicine fails because medicine is knowledge-limited. So the real question isn't 'how far can reasoning travel,' it's 'is the destination even asking a reasoning question.'

Sources 8 notes

Can a single training example unlock mathematical reasoning?

A single example in RLVR boosts math performance from 36% to 73.6% and enables test accuracy to improve for 1,400 steps after training accuracy reaches 100%, revealing that minimal activation signals unlock latent reasoning capability.

Do formal language prototypes improve reasoning across different domains?

Training on Prolog and PDDL representations improved logical reasoning by 4.7%, planning by 6.3%, and general reasoning by 4.0%. Models exposed to prototype languages generalized better to structurally similar problems than natural language-only training.

Why doesn't mathematical reasoning transfer to medicine?

R1-distilled reasoning models fail to outperform base models on medical tasks because knowledge accuracy matters more than reasoning quality in medicine—the opposite of math. Fine-tuning cannot close this gap without domain-specific training data.

Why does reasoning training help math but hurt medical tasks?

Two-phase inference model shows knowledge retrieval operates in lower network layers while reasoning adjustment happens in higher layers. This separation explains why reasoning training improves math but can degrade knowledge-intensive domains like medicine.

Does chain-of-thought reasoning actually generalize beyond training data?

DataAlchemy experiments show CoT fails systematically under distributional shifts in task, length, and format. Models produce fluent but logically inconsistent reasoning — imitating reasoning form without valid underlying logic.

Show all 8 sources

Does longer reasoning actually mean harder problems?

Controlled A* maze experiments show trace length correlates with difficulty only in-distribution but decouples entirely out-of-distribution. Trace length primarily reflects recall of training schemas, not adaptive computation.

Does RLVR success on math benchmarks reflect genuine reasoning improvement?

Qwen2.5-Math-7B reconstructs 54.6% of MATH-500 from partial prompts but scores 0.0% on post-release LiveMathBench, revealing dataset contamination. On clean benchmarks, only correct rewards improve performance; random and inverse rewards fail or degrade reasoning ability.

Can reasoning improvement work without answer verification?

VeriFree bypasses answer verification entirely by using the conditional probability of reference answers given generated reasoning traces as both reward signal and training weight. This approach matches or surpasses verifier-based methods on MMLU-Pro, GPQA, and SuperGPQA without rule-based or model-based verifiers.

Papers this line draws on 8

The research behind the notes this line reads — ranked by how closely each paper relates.

LLMs can implicitly learn from mistakes in-context2.51 match · arxiv ↗
Is Chain-of-Thought Reasoning of LLMs a Mirage? A Data Distribution Lens1.75 match · arxiv ↗
Knowledge or Reasoning? A Close Look at How LLMs Think Across Domains1.74 match · arxiv ↗
When More is Less: Understanding Chain-of-Thought Length in LLMs1.71 match · arxiv ↗
Reinforcement Learning for Reasoning in Large Language Models with One Training Example1.71 match · arxiv ↗
Spurious Rewards: Rethinking Training Signals in RLVR1.69 match · arxiv ↗
Local Coherence or Global Validity? Investigating RLVR Traces in Math Domains1.69 match · arxiv ↗
Escaping the Verifier: Learning to Reason via Demonstrations1.67 match · arxiv ↗

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst. The question remains open: **Can mathematical reasoning improvements transfer across problem subdomains?** A curated library (2025–02 through 2025–10) found:

**What a curated library found — and when (dated claims, not current truth):**
- Transfer works *within* structurally similar subdomains: a single training example activates latent reasoning and sustains gains post-saturation (2025–04), and formal-language scaffolds (Prolog, PDDL) lift reasoning broadly but only on structurally similar problems (2025–06).
- Reasoning-distilled models fail on knowledge-intensive domains (medicine): reasoning lives in higher network layers, knowledge in lower ones, so reasoning training actively degrades knowledge-dependent tasks (2025–06, 2025–07).
- Chain-of-thought is distribution-bounded: reasoning trace length tracks proximity to training data, not problem difficulty (2025–09), and CoT degrades predictably when task, length, or format shift (2025–08).
- Benchmark gains on contaminated math sets are partly memorization, not portable reasoning (2025–07).
- VeriFree reasoning RL matches verifier-based methods on broad benchmarks (MMLU-Pro, GPQA) without domain-specific checkers, extending reasoning training beyond verifiable subdomains (2025–05).

**Anchor papers (verify; mind their dates):**
- 2025–06: arXiv:2506.02126 (Knowledge or Reasoning? A Close Look...)
- 2025–04: arXiv:2504.20571 (Reinforcement Learning for Reasoning... One Training Example)
- 2025–07: arXiv:2507.18178 (Decoupling Knowledge and Reasoning...)
- 2025–05: arXiv:2505.21493 (Reinforcing General Reasoning without Verifiers)

**Your task:**
(1) **RE-TEST THE BOTTLENECK THESIS.** The library's core claim: transfer fails not by topic distance but by *what bottlenecks each domain* (reasoning vs. knowledge). For reasoning-limited subdomains, check whether recent model scale, instruction tuning, or mixture-of-experts architectures have since blurred the layer-wise knowledge/reasoning boundary or enabled cross-domain reasoning activation without retraining. For knowledge-intensive domains, test whether retrieval-augmented generation or dynamic knowledge injection now permits reasoning RL to land effectively. Flag whether VeriFree (2025–05) or successor methods now close the verifiability gap.
(2) **Surface the strongest contradicting work in the last 6 months** that claims transfer *does* succeed across knowledge-heavy subdomains, or shows reasoning traces *are* robust to distribution shift, or reveals CoT length *is* a valid proxy for problem difficulty.
(3) **Propose 2 new research questions** that assume the regime may have shifted: (a) Can dynamic layer-wise adaptation (e.g., freezing knowledge layers while training reasoning) enable cross-domain transfer? (b) Does in-context exemplar selection from the target subdomain, rather than general reasoning training, better explain transfer success than reasoning capability itself?

**Cite arXiv IDs; flag anything you cannot ground in a real paper.**

Getting better at one type of math doesn't always help with another — and knowing why changes everything.

Related lines of inquiry

Sources 8 notes

Papers this line draws on 8