INQUIRING LINE

Inquiring lines›How should agents manage and coord…›How can training approaches develo…›How do training data properties sh…›this inquiring line

When you give an AI a calculator, its supposed reasoning failures often disappear — so are we measuring thinking or just text generation?

What kinds of reasoning tasks reveal the ceiling of text-only training?

This explores which reasoning tasks expose the limits of training on text alone — and, crucially, whether those limits are a real ceiling on reasoning or an artifact of how we test and how language itself works.

This explores what kinds of tasks reveal the ceiling of text-only training — and the first thing the corpus does is split the question in two: is the ceiling in the model's reasoning, or in the text-only setup we're measuring it through? Several notes argue the dramatic 'reasoning cliffs' are largely illusions of evaluation. When models are confined to generating answers as text, they fail at long multi-step procedures even when they demonstrably know the algorithm — give them tool access and the cliff disappears Does the reasoning cliff depend on how we test models? Are reasoning model collapses really failures of reasoning?. The tasks that expose this are the ones demanding sustained procedural execution at scale: arithmetic over many steps, simulating a process, anything where you must hold and grind through state. The bottleneck there is execution bandwidth, not intelligence.

But a second cluster of notes points at a ceiling that tools can't fix. Physical, geometric, and causal reasoning tasks break because text itself is a lossy abstraction — it strips out the physics and spatial dynamics of the world, so a text-only model is manipulating symbols cut off from what they refer to Are text-only language models fundamentally limited by abstraction?. This is the genuinely architectural ceiling: not 'the model ran out of room to compute' but 'the training signal never contained the information.'

The sharpest diagnostic tasks are the ones that decouple *form* from *logic*. When you keep the reasoning structure but swap out familiar semantic content — nonsense tokens, out-of-distribution formats, unusual lengths — accuracy collapses, revealing that chain-of-thought is often reproducing learned reasoning *patterns* rather than performing fresh symbolic inference Does chain-of-thought reasoning reveal genuine inference or pattern matching? Does chain-of-thought reasoning actually generalize beyond training data?. The same point shows up as the symbolic/semantic split: give a model correct logical rules but strip the commonsense semantics, and it can't apply them — it reasons by association, not by formal manipulation Do large language models reason symbolically or semantically?. Even something as mundane as input length is a stress test: reasoning accuracy drops from 92% to 68% with a few thousand tokens of irrelevant padding, far below the context limit, in a way unrelated to language-modeling skill Does reasoning ability actually degrade with longer inputs?.

What makes this interesting is that the corpus also argues the raw capability is usually *already there* and the ceiling is about access, not absence. Reasoning generalizes from broad procedural knowledge spread across pretraining documents rather than memorized facts Does procedural knowledge drive reasoning more than factual retrieval?, and base models turn out to hold latent reasoning that minimal training merely selects and unlocks Do base models already contain hidden reasoning ability?. That reframes the whole question: a lot of what looks like a training ceiling is an *elicitation* ceiling.

If you want to push the boundary rather than just diagnose it, the corpus offers three different escape routes worth knowing about. Quiet-STaR grows reasoning as a side effect of ordinary next-token prediction on arbitrary text, no task-specific data required Can models learn reasoning from predicting any text?; latent-reasoning architectures scale test-time compute inside hidden states without ever verbalizing steps, suggesting the written chain-of-thought was a training artifact all along Can models reason without generating visible thinking tokens?; and Chain of Draft shows that ~92% of a verbose reasoning trace is documentation, not computation — equal accuracy at 7.6% of the tokens Can minimal reasoning chains match full explanations?. Taken together, the 'ceiling of text-only training' isn't one wall — it's an execution limit you can tool around, a grounding limit you can't, and an elicitation limit that was never really a limit at all.

Sources 12 notes

Does the reasoning cliff depend on how we test models?

Language models show catastrophic failure in text-only reasoning benchmarks but maintain scaling when given tool access. The cliff reflects execution constraints, not reasoning capability, making text-only evaluations systematically underestimate real-world performance.

Are reasoning model collapses really failures of reasoning?

Models confined to text-only generation cannot execute multi-step procedures at scale, even when they know the underlying algorithm. Tool-enabled models solve problems beyond the supposed reasoning cliff, suggesting the bottleneck is procedural execution bandwidth.

Are text-only language models fundamentally limited by abstraction?

Text strips the physics, geometry, and causality present in reality, forcing language models to manipulate symbols without grounding in their source dynamics. This creates predictable failure modes in physical, geometric, and causal reasoning that multimodal training could address.

Does chain-of-thought reasoning reveal genuine inference or pattern matching?

CoT works by constraining models to reproduce familiar reasoning patterns from training, not by enabling novel symbolic reasoning. Performance degrades predictably under distribution shifts—the signature of imitation rather than capability emergence.

Does chain-of-thought reasoning actually generalize beyond training data?

DataAlchemy experiments show CoT fails systematically under distributional shifts in task, length, and format. Models produce fluent but logically inconsistent reasoning — imitating reasoning form without valid underlying logic.

Show all 12 sources

Do large language models reason symbolically or semantically?

When semantic content is decoupled from reasoning tasks, LLM performance collapses even with correct rules in context. Models rely on parametric commonsense and token associations rather than formal logical manipulation, constraining reasoning to training distribution semantics.

Does reasoning ability actually degrade with longer inputs?

FLenQA shows reasoning accuracy drops from 92% to 68% at just 3000 tokens of padding, far below context window capacity. The degradation is task-agnostic, uncorrelated with language modeling performance, and persists even with chain-of-thought prompting.

Does procedural knowledge drive reasoning more than factual retrieval?

Analysis of 5 million pretraining documents shows reasoning relies on broad, transferable procedural knowledge from diverse sources, unlike factual recall which depends on narrow, document-specific memorization of target facts.

Do base models already contain hidden reasoning ability?

Five independent mechanisms—RL steering, critique fine-tuning, decoding changes, SAE feature steering, and RLVR—all elicit reasoning already present in base model activations. Post-training selects rather than creates reasoning; the bottleneck is elicitation, not capability acquisition.

Can models learn reasoning from predicting any text?

Quiet-STaR trains language models to generate rationales at every token position during pretraining on arbitrary internet text, enabling general reasoning without task-specific datasets. Rationale quality is judged by predictive accuracy rather than labeled correctness, allowing reasoning competence to emerge as a side effect of improved language modeling.

Can models reason without generating visible thinking tokens?

Multiple architectures—depth-recurrent models, Heima, and Coconut—demonstrate that test-time compute scales through hidden state iteration rather than token generation. This suggests verbalization is a training artifact, not a reasoning requirement.

Can minimal reasoning chains match full explanations?

Chain of Draft achieves equivalent accuracy to standard chain-of-thought on arithmetic, symbolic, and commonsense tasks while using only 7.6% of tokens. The 92.4% of removed tokens served style and documentation, not computation.

Papers this line draws on 8

The research behind the notes this line reads — ranked by how closely each paper relates.

Is Chain-of-Thought Reasoning of LLMs a Mirage? A Data Distribution Lens3.52 match · arxiv ↗
When More is Less: Understanding Chain-of-Thought Length in LLMs3.49 match · arxiv ↗
Hierarchical Reasoning Model3.49 match · arxiv ↗
Beyond Accuracy: Evaluating the Reasoning Behavior of Large Language Models -- A Survey3.43 match · arxiv ↗
A Comment On "The Illusion of Thinking": Reframing the Reasoning Cliff as an Agentic Gap2.64 match · arxiv ↗
Break the Chain: Large Language Models Can be Shortcut Reasoners2.62 match · arxiv ↗
On the Interplay of Pre-Training, Mid-Training, and RL on Reasoning Language Models2.54 match · arxiv ↗
CoT is Not True Reasoning, It Is Just a Tight Constraint to Imitate: A Theory Perspective1.81 match · arxiv ↗

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a reasoning-capability analyst. The question remains open: *What kinds of tasks reveal whether text-only training has hit a hard ceiling, versus an elicitation or execution bottleneck?*

What a curated library found — and when (dated claims, not current truth):
Findings span 2023–2026; treat each as perishable until re-tested.

• **Execution vs. reasoning split**: Long multi-step arithmetic and procedural tasks show 'reasoning cliffs' when models must output steps as text, but the cliff vanishes with tool access or latent (non-verbalized) reasoning — suggesting execution bandwidth, not reasoning capacity, is the bottleneck (2024–2025).
• **Grounding limit (harder)**: Physical, geometric, and causal reasoning collapse because text strips spatial/physical dynamics from the training signal — not a compute limit but an *information* limit in the pretraining corpus (2024–2025).
• **Form vs. logic decoupling**: Chain-of-thought accuracy drops sharply when semantic content is stripped or format is out-of-distribution, and padding input length to a few thousand tokens (well below context windows) degrades reasoning from 92% to 68%, suggesting reasoning is learned *pattern imitation* not symbolic inference (2024–2025).
• **Latent reasoning scales without verbalization**: Test-time compute in hidden states matches or exceeds verbose CoT accuracy at ~7.6% of tokens, implying the written chain-of-thought was a training artifact (2025).
• **Procedural knowledge is already there**: Reasoning generalizes from pretraining documents; base models hold latent reasoning that fine-tuning merely unlocks, reframing ceilings as *elicitation* limits (2025).

Anchor papers (verify; mind their dates):
- arXiv:2305.14825 (2023): Semantic, not symbolic, reasoning.
- arXiv:2402.14848 (2024): Input length stress-tests reasoning below context limit.
- arXiv:2403.09629 (2024): Quiet-STaR — reasoning as side effect of language modeling.
- arXiv:2502.05171 (2025): Latent reasoning scales test-time compute in hidden states.

Your task:
(1) **RE-TEST EACH CONSTRAINT.** For the three bottlenecks (execution, grounding, elicitation), determine whether post-2025 models, multimodal training, longer-context architectures, or novel fine-tuning have relaxed or overturned them. Separate the durable question (likely: does text-only pretraining reach a grounding ceiling?) from the perishable claim (e.g., 'CoT is pattern imitation'—has that been refuted?). Cite what resolved it.
(2) **Surface strongest SUPERSEDING work** from the last ~6 months. The June 2025 comments and March 2026 multimodal exploration may reframe the entire framing—what do they say?
(3) **Propose two research questions** that assume the regime has moved: e.g., *If latent reasoning scales test-time compute, does multimodal grounding matter?* or *Can procedural knowledge in text plus visual input overcome the grounding ceiling?*

Cite arXiv IDs; flag anything you cannot ground in a real paper.

When you give an AI a calculator, its supposed reasoning failures often disappear — so are we measuring thinking or just text generation?

Related lines of inquiry

Sources 12 notes

Papers this line draws on 8