INQUIRING LINE

Inquiring lines›Where does language-model reasonin…›How do language models represent m…›Do language models learn genuine l…›this inquiring line

Models know the rule but can't reliably follow it — what architectural changes would bridge knowing and doing?

What architectural changes would let language models develop genuine functional competence?

This reads the question as asking what would have to change in how models are built — beyond just scaling next-token prediction — for them to actually do what they understand, not merely sound like they understand.

This explores what architectural changes might let language models develop genuine *functional* competence — the ability to use language to act in the world correctly — rather than the *formal* fluency they already have. The corpus is unusually blunt about the diagnosis before it gets to remedies. The core claim is that fluency and competence are not the same machinery: neuroscience evidence suggests next-token prediction builds formal linguistic competence but never activates the integrated brain networks that functional understanding requires Are language models developing real functional competence or just formal competence?. A companion finding sharpens this into a 'split-brain' picture — models articulate the right principle 87% of the time but apply it correctly only 64% of the time, which is a structural disconnect between knowing and doing, not a gap in knowledge Can language models understand without actually executing correctly?. So the architectural question is really: how do you wire knowing to doing?

The corpus points at several distinct answers, and they disagree in interesting ways. One camp says the bottleneck isn't the model at all but the *system around it*. Turning a fluent model into a competent actor takes pipeline transformation — action-grounded datasets, tools, memory, and a harness that determines whether an action is grounded or hallucinated Can you turn an LLM into an agent by just fine-tuning?. This reframes 'architecture' from network internals to the whole agentic scaffold. A related limit: models can't simply think their way to competence, because reliable self-improvement is formally bounded by the generation-verification gap — every dependable fix needs something external to validate it What stops large language models from improving themselves?. That's a strong argument that functional competence must come from grounding signals the model can't generate for itself.

A second camp works inside the weights. Self-adaptive models that compose task-specific expert vectors at inference — tuning only singular values so specialized skills mix without interfering — suggest competence might come from dynamic, modular activation rather than one monolithic forward pass Can models dynamically activate expert skills at inference time?. A different structural lever is the *training curriculum*: feeding a model reasoning tasks derived from knowledge-graph paths produced state-of-the-art domain performance, implying that how knowledge is composed matters more than raw scale Can knowledge graphs teach models deep domain expertise?. And work on reasoning chains shows models already internally rank tokens by functional importance, preferentially preserving symbolic-computation steps over filler — a hint that the substrate of functional reasoning is partly there and could be trained for directly Which tokens in reasoning chains actually matter most?.

Here's what the reader might not expect: part of the deficit is *taught*, not innate. The 'grounding gap' work found LLMs perform 77.5% fewer grounding acts than humans — no clarifying questions, no understanding checks — because preference optimization actively strips those behaviors out, since raters prefer confident, complete answers Why do language models sound fluent without grounding?. The very training step meant to align models removes the interactive moves competence depends on. That suggests one architectural change is cheap in principle: stop optimizing the competence away. Relatedly, DPO training on explicit correct/incorrect examples beats plain fine-tuning precisely because negative examples target the format-and-execution failures where knowing-but-not-doing lives Can small models match large models on function calling?.

The deepest tension is philosophical. One line of work argues LLMs operationalize Saussure's *langue* — they learn meaning as pure relational structure compressed from text, with no external referents, and do so successfully Can language models learn meaning without engaging the world?. If that's right, relational competence is real and grounding may be less necessary than the embodiment camp claims. Yet the same models still misidentify embedded clauses and degrade predictably as syntactic depth increases — surface statistics, not deep rules Why do large language models fail at complex linguistic tasks? — even while o1-style step-by-step reasoning lets them build valid syntactic trees and phonological generalizations they can't produce in one shot Can language models actually analyze language structure?. The convergent signal across the corpus: functional competence seems to need explicit reasoning steps, external verification, modular skill composition, and grounding behaviors — and the single biggest architectural change may be to stop training those last ones out.

Sources 12 notes

Are language models developing real functional competence or just formal competence?

Neuroscience evidence shows next-token prediction produces formal linguistic competence but not functional competence, because functional understanding requires integration of diverse brain networks beyond language circuits that the prediction objective never activates.

Can language models understand without actually executing correctly?

Large language models can articulate correct principles but systematically fail to apply them due to dissociated instruction and execution pathways. The 87% accuracy in explanations versus 64% in actions reveals this is not knowledge deficit but structural disconnect.

Can you turn an LLM into an agent by just fine-tuning?

Converting LLMs to action-capable systems requires four distinct stages: curating action-environment-user datasets, training for action grounding, integrating agent infrastructure with memory and tools, and rigorous safety evaluation. The surrounding system and harness determine whether actions are grounded or hallucinated.

What stops large language models from improving themselves?

Self-improvement in LLMs is formally bounded by the generation-verification gap, meaning every reliable fix requires something external to validate and enforce it. Models cannot escape this constraint through metacognition alone.

Can models dynamically activate expert skills at inference time?

Transformer2 demonstrates that tuning only singular values within weight matrices produces composable expert vectors that dynamically mix at inference without interference, outperforming LoRA with fewer parameters and enabling continual specialization.

Show all 12 sources

Can knowledge graphs teach models deep domain expertise?

Fine-tuning a 32B model on 24,000 reasoning tasks derived from medical knowledge graph paths produces state-of-the-art performance across 15 medical domains, demonstrating that structured knowledge composition matters more than scale.

Which tokens in reasoning chains actually matter most?

Greedy likelihood-preserving pruning reveals six functional token categories; symbolic computation tokens are preferentially preserved while grammar and meta-discourse are pruned first. Student models trained on these pruned chains outperform those trained on frontier-model compression.

Why do language models sound fluent without grounding?

LLMs generate 77.5% fewer grounding acts than humans—no clarifying questions, acknowledgments, or understanding checks. Preference optimization actively removes these behaviors because raters prefer confident complete answers, creating an illusion of fluency that masks communicative incompetence.

Can small models match large models on function calling?

Small models fine-tuned via DPO on correct and incorrect function-calling examples from a large teacher model achieve high accuracy on logical and mathematical tasks. DPO's explicit negative examples directly target the rigid output format failures where SFT alone underperforms.

Can language models learn meaning without engaging the world?

Research shows LLMs learn culturally situated discourse patterns by compressing relational structure from text, demonstrating that fluent language generation requires no external referents or embodied grounding.

Why do large language models fail at complex linguistic tasks?

Top-tier LLMs like Llama3-70b consistently misidentify embedded clauses, verb phrases, and complex nominals. Performance degrades predictably as syntactic depth increases, revealing that statistical learning captures surface patterns but not deep grammatical rules.

Can language models actually analyze language structure?

OpenAI's o1 model successfully constructs syntactic trees and phonological generalizations through explicit step-by-step reasoning, revealing that LLM linguistic capability extends far beyond behavioral language tasks to genuine language analysis.

Papers this line draws on 8

The research behind the notes this line reads — ranked by how closely each paper relates.

Beyond Accuracy: Evaluating the Reasoning Behavior of Large Language Models -- A Survey4.17 match · arxiv ↗
Probing Structured Semantics Understanding and Generation of Language Models via Question Answering3.42 match · arxiv ↗
Linguistic Blind Spots of Large Language Models1.75 match · arxiv ↗
Large Language Model Reasoning Failures1.73 match · arxiv ↗
Comprehension Without Competence: Architectural Limits of LLMs in Symbolic Computation and Reasoning1.70 match · arxiv ↗
Branch-Solve-Merge Improves Large Language Model Evaluation and Generation1.70 match · arxiv ↗
Large Linguistic Models: Investigating LLMs' metalinguistic abilities1.70 match · arxiv ↗
From Tokens to Thoughts: How LLMs and Humans Trade Compression for Meaning1.67 match · arxiv ↗

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

As an AI researcher, evaluate whether architectural constraints on functional competence in LLMs have been relaxed or remain binding. A curated library (spanning 2023–2026) identified these dated claims:

**What a curated library found — and when:**
• Fluency and functional competence are neurologically distinct; next-token prediction builds formal competence only (~2023).
• Models articulate principles 87% of the time but apply them correctly 64% of the time — a structural knowing-doing gap (~2024–2025).
• Preference optimization strips grounding behaviors; models perform 77.5% fewer grounding acts than humans (~2025).
• DPO training on explicit correct/incorrect examples outperforms plain fine-tuning on function-calling (~2024).
• Self-adaptive models composing expert vectors at inference show promise for dynamic, modular skill activation (~2025).

**Anchor papers (verify; mind their dates):**
• arXiv:2301.06627 (2023) — Dissociating language and thought
• arXiv:2507.10624 (2025) — Comprehension Without Competence
• arXiv:2501.06252 (2025) — Transformer2: Self-adaptive LLMs
• arXiv:2507.13966 (2025) — Knowledge-graph curriculum and domain superintelligence

**Your task:**
(1) **RE-TEST EACH CONSTRAINT.** For the knowing-doing gap, grounding deficit, and modular composition bottleneck: has newer reasoning (o1-style, test-time compute scaling, or hybrid architectures), tooling (multi-agent orchestration, formal verifiers), or training (reinforcement learning on execution traces, outcome-based rewards) since closed these gaps? Separate durable architectural questions from perishable limitations and cite what resolved them.
(2) **Surface the strongest CONTRADICTING or SUPERSEDING work** from the last 6 months. Does any recent paper argue fluency-to-competence bridging requires no architectural change, or that grounding is less necessary than claimed?
(3) **Propose 2 research questions** that assume the regime may have shifted — e.g., given test-time scaling, does the curriculum bottleneck dissolve? Given multimodal grounding, does the knowing-doing gap shrink?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Models know the rule but can't reliably follow it — what architectural changes would bridge knowing and doing?

Related lines of inquiry

Sources 12 notes

Papers this line draws on 8