INQUIRING LINE

Inquiring lines›Where does language-model reasonin…›How do modularity, routing, and se…›Do language models perform faithfu…›this inquiring line

Can AI invent genuinely new working designs on its own, or does it need explicit scaffolding to go beyond looking right?

Can LLMs reliably generate novel working architectures without structured representations?

This explores whether LLMs can produce genuinely new, *functioning* solutions on their own — and what role explicit structured or symbolic scaffolding plays in whether that output actually works versus just looks right.

This explores whether LLMs can produce genuinely new, functioning solutions on their own — and what role explicit structure plays in whether the output actually works. The corpus pulls in two directions at once, and the tension is the interesting part. On one hand, structure is not absent inside these models: networks spontaneously decompose compositional tasks into isolated modular subnetworks Do neural networks naturally learn modular compositional structure?, and they encode syntax in surprisingly clean geometric form — type and direction laid out in polar coordinates within their activations How do language models encode syntactic relations geometrically?. So in some sense LLMs *do* grow structured representations without being told to. The catch is that this learned structure is implicit and statistical, not the explicit symbolic machinery the question implies.

And that distinction is exactly where reliability breaks down. When semantics are stripped away and a task demands actual symbolic manipulation, performance collapses even with the correct rules sitting in context — the models reason by association, not formal logic Do large language models reason symbolically or semantically?. Ask them to genuinely *execute* an iterative procedure and they don't; they recognize the problem as template-similar and emit plausible-but-wrong values, a failure that doesn't go away with scale Do large language models actually perform iterative optimization?. Most striking is the 'potemkin' pattern: a model can explain a concept correctly, fail to apply it, and even recognize its own failure — explanation and execution running on disconnected tracks Can LLMs understand concepts they cannot apply?. A novel architecture that the model can describe is not a novel architecture the model can make work.

The degradation is also predictable rather than random. Linguistic competence falls off as structural depth increases — embedded clauses, nested phrases — revealing surface patterns captured but deep compositional rules not Why do large language models fail at complex linguistic tasks?. The same boundary shows up at the systems level: long-context models can absorb retrieval tasks but still can't execute relational queries that require joining across structured tables, and more context doesn't fix it Can long-context LLMs replace retrieval-augmented generation systems?. The wall is structural, not informational.

Here's the part you might not expect to want: imposing structure *externally* is what recovers the capability. Wrapping reasoning operations as sandboxed, modular tool calls jumped GPT-4.1 on a hard math benchmark from 27% to 43% with no additional training — the modularity enforced an operation isolation that plain prompting couldn't guarantee, eliciting reasoning that was already latent Can modular cognitive tools unlock reasoning without training?. The latent ability existed; the structured scaffold is what made it reliable. This also reframes 'novel': genuine creativity may need combinational, exploratory, and transformational modes that current methods don't address at all Can LLMs reason creatively beyond conventional problem-solving?, and self-improvement hits a hard ceiling set by the gap between generating an idea and verifying it — every reliable fix needs something external to validate it What stops large language models from improving themselves?.

So the honest answer is no, not reliably — but the reason is subtler than 'LLMs can't.' They carry rich implicit structure and real latent ability, yet the bridge from a plausible description to a *working* novel system keeps requiring external structure: a verifier, a modular harness, a symbolic check. And a final unsettling note for anyone evaluating such systems — identical outputs can hide radically different internal mechanisms, and pushing one metric like accuracy reliably degrades others like faithfulness What really happens inside a language model?. An architecture that 'works' on the surface may not work for the reasons you think.

Sources 11 notes

Do neural networks naturally learn modular compositional structure?

Pruning experiments reveal that neural networks implement compositional subroutines in isolated subnetworks, with ablations affecting only their corresponding function. Pretraining substantially increases the consistency and reliability of this modular structure across architectures and domains.

How do language models encode syntactic relations geometrically?

The Polar Probe shows LLMs represent syntactic type and direction through both distance and angular position between embeddings, nearly doubling accuracy over distance-only methods. This demonstrates neural networks spontaneously learn structured, symbolic-compatible geometry.

Do large language models reason symbolically or semantically?

When semantic content is decoupled from reasoning tasks, LLM performance collapses even with correct rules in context. Models rely on parametric commonsense and token associations rather than formal logical manipulation, constraining reasoning to training distribution semantics.

Do large language models actually perform iterative optimization?

Research shows LLMs cannot perform iterative procedures in latent space. They recognize optimization problems as template-similar and emit plausible-looking but incorrect values, a failure mode that persists across model scale and training approaches.

Can LLMs understand concepts they cannot apply?

Models can explain concepts accurately, fail to apply them, and recognize the failure—a triple pattern incompatible with human cognition. This indicates functionally disconnected explanation and execution pathways rather than simple knowledge gaps.

Show all 11 sources

Why do large language models fail at complex linguistic tasks?

Top-tier LLMs like Llama3-70b consistently misidentify embedded clauses, verb phrases, and complex nominals. Performance degrades predictably as syntactic depth increases, revealing that statistical learning captures surface patterns but not deep grammatical rules.

Can long-context LLMs replace retrieval-augmented generation systems?

The LOFT benchmark shows LCLMs match RAG on semantic retrieval without explicit training, but cannot execute relational queries requiring joins across structured tables. Context length alone cannot bridge this gap.

Can modular cognitive tools unlock reasoning without training?

Four cognitive tools implemented as sandboxed LLM calls improved GPT-4.1 on AIME2024 from 26.7% to 43.3% without any RL training. Modularity enforces operation isolation that pure prompting cannot guarantee, eliciting pre-existing reasoning capability.

Can LLMs reason creatively beyond conventional problem-solving?

Research identifies combinational, exploratory, and transformational reasoning as distinct creative modes grounded in cognitive science. Existing LLM reasoning methods address only conventional problem-solving, leaving creative paradigms unaddressed and potentially explaining diversity collapse in ideation.

What stops large language models from improving themselves?

Self-improvement in LLMs is formally bounded by the generation-verification gap, meaning every reliable fix requires something external to validate and enforce it. Models cannot escape this constraint through metacognition alone.

What really happens inside a language model?

Research into mechanistic interpretability, cognitive models, and training dynamics shows that identical benchmark performance conceals radically different internal structures. Improving one capability (helpfulness, accuracy) reliably degrades others (faithfulness, calibration, diversity).

Papers this line draws on 8

The research behind the notes this line reads — ranked by how closely each paper relates.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a capability researcher evaluating whether LLMs can generate novel working architectures without external structural scaffolding. The question remains open: does internal implicit structure suffice, or is external symbolic enforcement necessary?

What a curated library found — and when (dated claims, not current truth):
Findings span 2023–2026 and expose a consistent tension:
• LLMs spontaneously develop modular subnetworks and encode syntax in clean geometric form (polar coordinates in activations), suggesting rich *implicit* structure (2023–2025).
• Yet when semantics are stripped away, models collapse into template-matching and fail at symbolic execution, iterative procedures, and relational queries — limitations that don't fade with scale or context length (2023–2024).
• 'Potemkin understanding' is a distinct failure mode: correct explanation + failed execution + self-recognized failure running on disconnected tracks (2025).
• External structure (modular tool calls, sandboxed operations) recovers capability: GPT-4.1 math performance jumped 27%→43% on hard benchmarks with no new training, revealing latent ability unlocked by modularity (2025).
• Creative reasoning requires three paradigms (combinational, exploratory, transformational) not currently addressed; self-improvement hits a ceiling set by the generate–verify gap (2025–2026).

Anchor papers (verify; mind their dates):
• arXiv:2305.14825 (2023-05): In-Context Semantic Reasoners rather than Symbolic Reasoners
• arXiv:2406.13121 (2024-06): Long-Context LLMs and Relational Reasoning
• arXiv:2506.12115 (2025-06): Cognitive Tools and Modular Tool Calls
• arXiv:2511.20471 (2025-11): Universe of Thoughts and Creative Reasoning

Your task:
(1) RE-TEST THE STRUCTURAL DEGRADATION CLAIM. Since mid-2026, have new training methods (scaffolded pre-training, structured objectives), architectural innovations (hybrid symbolic–neural), or evaluation harnesses (mechanistic verifiers) relaxed the boundary between latent ability and reliable execution? Check whether the 27%→43% jump generalizes; does modular tool-calling remain the bottleneck, or have language-only methods closed the gap?
(2) Surface the strongest contradicting work from the last ~6 months: any papers showing LLMs generating *reliable* novel systems without external structure, or refuting the potemkin-understanding mechanism.
(3) Propose 2 research questions assuming the regime has moved: (a) If implicit structure is now sufficient for symbolic tasks, what changed in the model or training? (b) If external structure remains necessary, can we build self-improving scaffolding that doesn't require human verification?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Can AI invent genuinely new working designs on its own, or does it need explicit scaffolding to go beyond looking right?

Related lines of inquiry

Sources 11 notes

Papers this line draws on 8