What architectural changes would let language models develop genuine functional competence?
This reads the question as asking what would have to change in how models are built — beyond just scaling next-token prediction — for them to actually do what they understand, not merely sound like they understand.
This explores what architectural changes might let language models develop genuine *functional* competence — the ability to use language to act in the world correctly — rather than the *formal* fluency they already have. The corpus is unusually blunt about the diagnosis before it gets to remedies. The core claim is that fluency and competence are not the same machinery: neuroscience evidence suggests next-token prediction builds formal linguistic competence but never activates the integrated brain networks that functional understanding requires Are language models developing real functional competence or just formal competence?. A companion finding sharpens this into a 'split-brain' picture — models articulate the right principle 87% of the time but apply it correctly only 64% of the time, which is a structural disconnect between knowing and doing, not a gap in knowledge Can language models understand without actually executing correctly?. So the architectural question is really: how do you wire knowing to doing?
The corpus points at several distinct answers, and they disagree in interesting ways. One camp says the bottleneck isn't the model at all but the *system around it*. Turning a fluent model into a competent actor takes pipeline transformation — action-grounded datasets, tools, memory, and a harness that determines whether an action is grounded or hallucinated Can you turn an LLM into an agent by just fine-tuning?. This reframes 'architecture' from network internals to the whole agentic scaffold. A related limit: models can't simply think their way to competence, because reliable self-improvement is formally bounded by the generation-verification gap — every dependable fix needs something external to validate it What stops large language models from improving themselves?. That's a strong argument that functional competence must come from grounding signals the model can't generate for itself.
A second camp works inside the weights. Self-adaptive models that compose task-specific expert vectors at inference — tuning only singular values so specialized skills mix without interfering — suggest competence might come from dynamic, modular activation rather than one monolithic forward pass Can models dynamically activate expert skills at inference time?. A different structural lever is the *training curriculum*: feeding a model reasoning tasks derived from knowledge-graph paths produced state-of-the-art domain performance, implying that how knowledge is composed matters more than raw scale Can knowledge graphs teach models deep domain expertise?. And work on reasoning chains shows models already internally rank tokens by functional importance, preferentially preserving symbolic-computation steps over filler — a hint that the substrate of functional reasoning is partly there and could be trained for directly Which tokens in reasoning chains actually matter most?.
Here's what the reader might not expect: part of the deficit is *taught*, not innate. The 'grounding gap' work found LLMs perform 77.5% fewer grounding acts than humans — no clarifying questions, no understanding checks — because preference optimization actively strips those behaviors out, since raters prefer confident, complete answers Why do language models sound fluent without grounding?. The very training step meant to align models removes the interactive moves competence depends on. That suggests one architectural change is cheap in principle: stop optimizing the competence away. Relatedly, DPO training on explicit correct/incorrect examples beats plain fine-tuning precisely because negative examples target the format-and-execution failures where knowing-but-not-doing lives Can small models match large models on function calling?.
The deepest tension is philosophical. One line of work argues LLMs operationalize Saussure's *langue* — they learn meaning as pure relational structure compressed from text, with no external referents, and do so successfully Can language models learn meaning without engaging the world?. If that's right, relational competence is real and grounding may be less necessary than the embodiment camp claims. Yet the same models still misidentify embedded clauses and degrade predictably as syntactic depth increases — surface statistics, not deep rules Why do large language models fail at complex linguistic tasks? — even while o1-style step-by-step reasoning lets them build valid syntactic trees and phonological generalizations they can't produce in one shot Can language models actually analyze language structure?. The convergent signal across the corpus: functional competence seems to need explicit reasoning steps, external verification, modular skill composition, and grounding behaviors — and the single biggest architectural change may be to stop training those last ones out.
Sources 12 notes
Neuroscience evidence shows next-token prediction produces formal linguistic competence but not functional competence, because functional understanding requires integration of diverse brain networks beyond language circuits that the prediction objective never activates.
Large language models can articulate correct principles but systematically fail to apply them due to dissociated instruction and execution pathways. The 87% accuracy in explanations versus 64% in actions reveals this is not knowledge deficit but structural disconnect.
Converting LLMs to action-capable systems requires four distinct stages: curating action-environment-user datasets, training for action grounding, integrating agent infrastructure with memory and tools, and rigorous safety evaluation. The surrounding system and harness determine whether actions are grounded or hallucinated.
Self-improvement in LLMs is formally bounded by the generation-verification gap, meaning every reliable fix requires something external to validate and enforce it. Models cannot escape this constraint through metacognition alone.
Transformer2 demonstrates that tuning only singular values within weight matrices produces composable expert vectors that dynamically mix at inference without interference, outperforming LoRA with fewer parameters and enabling continual specialization.
Fine-tuning a 32B model on 24,000 reasoning tasks derived from medical knowledge graph paths produces state-of-the-art performance across 15 medical domains, demonstrating that structured knowledge composition matters more than scale.
Greedy likelihood-preserving pruning reveals six functional token categories; symbolic computation tokens are preferentially preserved while grammar and meta-discourse are pruned first. Student models trained on these pruned chains outperform those trained on frontier-model compression.
LLMs generate 77.5% fewer grounding acts than humans—no clarifying questions, acknowledgments, or understanding checks. Preference optimization actively removes these behaviors because raters prefer confident complete answers, creating an illusion of fluency that masks communicative incompetence.
Small models fine-tuned via DPO on correct and incorrect function-calling examples from a large teacher model achieve high accuracy on logical and mathematical tasks. DPO's explicit negative examples directly target the rigid output format failures where SFT alone underperforms.
Research shows LLMs learn culturally situated discourse patterns by compressing relational structure from text, demonstrating that fluent language generation requires no external referents or embodied grounding.
Top-tier LLMs like Llama3-70b consistently misidentify embedded clauses, verb phrases, and complex nominals. Performance degrades predictably as syntactic depth increases, revealing that statistical learning captures surface patterns but not deep grammatical rules.
OpenAI's o1 model successfully constructs syntactic trees and phonological generalizations through explicit step-by-step reasoning, revealing that LLM linguistic capability extends far beyond behavioral language tasks to genuine language analysis.