INQUIRING LINE

Do LLMs lack architectural scaffolding for compositional reasoning?

This reads the question as: do LLMs lack the built-in machinery to reliably combine reasoning steps into larger structures — and is that why bolting on external structure (algorithms, symbols, graphs) tends to help?


This explores whether compositional reasoning is something LLMs do natively or something they need to be scaffolded into — and the corpus leans toward the latter, with an interesting twist on *why*. Several notes converge on the idea that the limitation isn't a knowledge gap but a structural one. The sharpest version of this is the "comprehension without competence" finding: models can state a correct principle (87% accuracy) yet fail to apply it (64%), a dissociation between knowing and executing that looks less like ignorance and more like a missing pathway connecting the two Can language models understand without actually executing correctly?. A related clue is that LLMs reason by semantic association rather than symbolic manipulation: when you strip the familiar meaning out of a task and leave only the logical structure, performance collapses even though the rules are right there in the prompt Do large language models reason symbolically or semantically?. Compositional reasoning needs structure that survives when semantics are removed — and that's exactly where these models buckle.

The failure shows up as a predictable *shape*, which is the tell that something architectural is going on. Reasoning models behave like "wandering explorers, not systematic searchers," so success probability drops exponentially as problems get deeper — they handle shallow composition but fall off a cliff when steps must stack Why do reasoning LLMs fail at deeper problem solving?. The same exponential-with-depth signature appears in language itself: top models reliably misparse embedded clauses and nested phrases, and the error rate climbs in lockstep with syntactic depth Why do large language models fail at complex linguistic tasks?. Composition is exactly the operation that requires holding nested structure together, and that's the operation that degrades.

The strongest evidence that the scaffolding is *missing* rather than merely weak comes from how much external structure helps. LLM Programs wrap models inside explicit algorithms that manage control flow and feed each call only the context it needs — treating reasoning as modular, debuggable sub-tasks the model can't be trusted to sequence on its own Can algorithms control LLM reasoning better than LLMs alone?. Externalizing reasoning into knowledge-graph triples lets a small model (GPT-4o mini) jump 29% on hard GAIA tasks, because the graph holds the compositional state the model won't Can structuring reasoning as knowledge graphs help smaller models solve complex tasks?. And partial symbolic augmentation beats both pure language *and* full formalization — a hint that models need *some* imported structure, just not so much that it strips out the semantics they actually run on Why does partial formalization outperform full symbolic logic?.

Here's the thing you might not expect: "lack of scaffolding" may be the wrong frame for *where* to look. One line of work argues the real reasoning happens in hidden-state trajectories, with the visible chain-of-thought being only a partial interface to it Where does LLM reasoning actually happen during generation?. If that's right, the compositional machinery isn't absent — it's latent and unreliable, which is why externalizing it onto an explicit algorithm or graph makes it suddenly work. So the honest answer is layered: LLMs lack *dependable* compositional scaffolding, the deficit has a clear architectural fingerprint (it worsens with depth and with semantic abstraction), and the most effective fixes don't teach the model to compose — they move the composing outside the model entirely.


Sources 8 notes

Can language models understand without actually executing correctly?

Large language models can articulate correct principles but systematically fail to apply them due to dissociated instruction and execution pathways. The 87% accuracy in explanations versus 64% in actions reveals this is not knowledge deficit but structural disconnect.

Do large language models reason symbolically or semantically?

When semantic content is decoupled from reasoning tasks, LLM performance collapses even with correct rules in context. Models rely on parametric commonsense and token associations rather than formal logical manipulation, constraining reasoning to training distribution semantics.

Why do reasoning LLMs fail at deeper problem solving?

Current reasoning models lack the three properties of systematic exploration: validity, effectiveness, and necessity. This causes success probability to drop exponentially with problem depth, making medium problems solvable but deep problems catastrophically harder.

Why do large language models fail at complex linguistic tasks?

Top-tier LLMs like Llama3-70b consistently misidentify embedded clauses, verb phrases, and complex nominals. Performance degrades predictably as syntactic depth increases, revealing that statistical learning captures surface patterns but not deep grammatical rules.

Can algorithms control LLM reasoning better than LLMs alone?

LLM Programs embed LLMs within explicit algorithms that manage control flow and state, presenting only step-specific context to each LLM call. This information hiding addresses capability and context window limits while treating complex reasoning as modular, debuggable sub-tasks.

Can structuring reasoning as knowledge graphs help smaller models solve complex tasks?

Knowledge Graph of Thoughts (KGoT) achieves 29% improvement on GAIA Level 3 tasks using GPT-4o mini by externalizing reasoning into iteratively constructed KG triples. The approach improves transparency, reduces bias, and enables quality control over reasoning steps.

Why does partial formalization outperform full symbolic logic?

QuaSAR and Logic-of-Thought both achieve 4-8% accuracy gains by enriching natural language with selective symbolic elements rather than replacing it. Full formalization loses semantic information; pure language lacks structure. Augmentation preserves both.

Where does LLM reasoning actually happen during generation?

Evidence from CoT faithfulness tests, feature steering, and layer analysis suggests latent-state dynamics drive reasoning, while surface chain-of-thought serves as a partial interface. Hidden reasoning processes should be the default focus of study.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a researcher re-testing whether LLMs lack architectural scaffolding for compositional reasoning—a question a curated library addressed from 2023–2026, but the claimed constraints may have shifted. Treat the findings below as dated; your job is to surface what has changed.

What a curated library found — and when (dated claims, not current truth):
• Comprehension–competence dissociation: models state principles correctly (87%) but fail application (64%), suggesting a structural gap rather than knowledge deficit (2025-07).
• Performance collapses when semantics are stripped, revealing dependence on semantic association over symbolic manipulation; compositional reasoning needs structure surviving semantic removal (2023-05).
• Exponential depth-penalty signature: reasoning quality drops steeply with nesting depth; error rate on embedded clauses climbs lockstep with syntactic depth (2025-03, 2025-05).
• External scaffolding yields large gains: knowledge-graph externalization (+29% on GAIA, GPT-4o mini); LLM Programs decompose reasoning into modular sub-tasks (2025-04, undated reference).
• Reasoning may be latent in hidden-state trajectories, not visible in chain-of-thought; compositional machinery could be unreliable rather than absent (2026-04).

Anchor papers (verify; mind their dates):
• arXiv:2305.14825 (2023-05): In-context semantic vs. symbolic reasoning
• arXiv:2503.19260 (2025-03): Linguistic blind spots worsening with structure
• arXiv:2507.10624 (2025-07): Comprehension without competence
• arXiv:2604.15726 (2026-04): Reasoning as latent trajectory

Your task:
(1) **RE-TEST EACH CONSTRAINT.** For the depth penalty, semantic-dependence, and comprehension–competence split: have newer reasoning models (o1 successor, math-specialized variants, or models with explicit planning layers) relaxed or overturned these limits? Cite what changed. Separately, has external scaffolding matured (better orchestration SDKs, agent frameworks) such that the *practical* scaffolding burden has shifted?
(2) **Surface strongest contradicting work from last 6 months.** Has any recent paper claim that compositionality *emerges* natively at scale, or that depth penalties are training-artifact not architectural? Flag disagreements head-on.
(3) **Propose 2 research questions assuming regime shift:** (a) If reasoning is latent, how should we design evals and scaffolding to *elicit* rather than *impose* structure? (b) If semantic dependence is the bottleneck, do domain-specialized token vocabularies or abstraction-aware architectures dissolve it?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines