Do LLMs lack architectural scaffolding for compositional reasoning?
This reads the question as: do LLMs lack the built-in machinery to reliably combine reasoning steps into larger structures — and is that why bolting on external structure (algorithms, symbols, graphs) tends to help?
This explores whether compositional reasoning is something LLMs do natively or something they need to be scaffolded into — and the corpus leans toward the latter, with an interesting twist on *why*. Several notes converge on the idea that the limitation isn't a knowledge gap but a structural one. The sharpest version of this is the "comprehension without competence" finding: models can state a correct principle (87% accuracy) yet fail to apply it (64%), a dissociation between knowing and executing that looks less like ignorance and more like a missing pathway connecting the two Can language models understand without actually executing correctly?. A related clue is that LLMs reason by semantic association rather than symbolic manipulation: when you strip the familiar meaning out of a task and leave only the logical structure, performance collapses even though the rules are right there in the prompt Do large language models reason symbolically or semantically?. Compositional reasoning needs structure that survives when semantics are removed — and that's exactly where these models buckle.
The failure shows up as a predictable *shape*, which is the tell that something architectural is going on. Reasoning models behave like "wandering explorers, not systematic searchers," so success probability drops exponentially as problems get deeper — they handle shallow composition but fall off a cliff when steps must stack Why do reasoning LLMs fail at deeper problem solving?. The same exponential-with-depth signature appears in language itself: top models reliably misparse embedded clauses and nested phrases, and the error rate climbs in lockstep with syntactic depth Why do large language models fail at complex linguistic tasks?. Composition is exactly the operation that requires holding nested structure together, and that's the operation that degrades.
The strongest evidence that the scaffolding is *missing* rather than merely weak comes from how much external structure helps. LLM Programs wrap models inside explicit algorithms that manage control flow and feed each call only the context it needs — treating reasoning as modular, debuggable sub-tasks the model can't be trusted to sequence on its own Can algorithms control LLM reasoning better than LLMs alone?. Externalizing reasoning into knowledge-graph triples lets a small model (GPT-4o mini) jump 29% on hard GAIA tasks, because the graph holds the compositional state the model won't Can structuring reasoning as knowledge graphs help smaller models solve complex tasks?. And partial symbolic augmentation beats both pure language *and* full formalization — a hint that models need *some* imported structure, just not so much that it strips out the semantics they actually run on Why does partial formalization outperform full symbolic logic?.
Here's the thing you might not expect: "lack of scaffolding" may be the wrong frame for *where* to look. One line of work argues the real reasoning happens in hidden-state trajectories, with the visible chain-of-thought being only a partial interface to it Where does LLM reasoning actually happen during generation?. If that's right, the compositional machinery isn't absent — it's latent and unreliable, which is why externalizing it onto an explicit algorithm or graph makes it suddenly work. So the honest answer is layered: LLMs lack *dependable* compositional scaffolding, the deficit has a clear architectural fingerprint (it worsens with depth and with semantic abstraction), and the most effective fixes don't teach the model to compose — they move the composing outside the model entirely.
Sources 8 notes
Large language models can articulate correct principles but systematically fail to apply them due to dissociated instruction and execution pathways. The 87% accuracy in explanations versus 64% in actions reveals this is not knowledge deficit but structural disconnect.
When semantic content is decoupled from reasoning tasks, LLM performance collapses even with correct rules in context. Models rely on parametric commonsense and token associations rather than formal logical manipulation, constraining reasoning to training distribution semantics.
Current reasoning models lack the three properties of systematic exploration: validity, effectiveness, and necessity. This causes success probability to drop exponentially with problem depth, making medium problems solvable but deep problems catastrophically harder.
Top-tier LLMs like Llama3-70b consistently misidentify embedded clauses, verb phrases, and complex nominals. Performance degrades predictably as syntactic depth increases, revealing that statistical learning captures surface patterns but not deep grammatical rules.
LLM Programs embed LLMs within explicit algorithms that manage control flow and state, presenting only step-specific context to each LLM call. This information hiding addresses capability and context window limits while treating complex reasoning as modular, debuggable sub-tasks.
Knowledge Graph of Thoughts (KGoT) achieves 29% improvement on GAIA Level 3 tasks using GPT-4o mini by externalizing reasoning into iteratively constructed KG triples. The approach improves transparency, reduces bias, and enables quality control over reasoning steps.
QuaSAR and Logic-of-Thought both achieve 4-8% accuracy gains by enriching natural language with selective symbolic elements rather than replacing it. Full formalization loses semantic information; pure language lacks structure. Augmentation preserves both.
Evidence from CoT faithfulness tests, feature steering, and layer analysis suggests latent-state dynamics drive reasoning, while surface chain-of-thought serves as a partial interface. Hidden reasoning processes should be the default focus of study.