Can LLMs reliably generate novel working architectures without structured representations?
This explores whether LLMs can produce genuinely new, *functioning* solutions on their own — and what role explicit structured or symbolic scaffolding plays in whether that output actually works versus just looks right.
This explores whether LLMs can produce genuinely new, functioning solutions on their own — and what role explicit structure plays in whether the output actually works. The corpus pulls in two directions at once, and the tension is the interesting part. On one hand, structure is not absent inside these models: networks spontaneously decompose compositional tasks into isolated modular subnetworks Do neural networks naturally learn modular compositional structure?, and they encode syntax in surprisingly clean geometric form — type and direction laid out in polar coordinates within their activations How do language models encode syntactic relations geometrically?. So in some sense LLMs *do* grow structured representations without being told to. The catch is that this learned structure is implicit and statistical, not the explicit symbolic machinery the question implies.
And that distinction is exactly where reliability breaks down. When semantics are stripped away and a task demands actual symbolic manipulation, performance collapses even with the correct rules sitting in context — the models reason by association, not formal logic Do large language models reason symbolically or semantically?. Ask them to genuinely *execute* an iterative procedure and they don't; they recognize the problem as template-similar and emit plausible-but-wrong values, a failure that doesn't go away with scale Do large language models actually perform iterative optimization?. Most striking is the 'potemkin' pattern: a model can explain a concept correctly, fail to apply it, and even recognize its own failure — explanation and execution running on disconnected tracks Can LLMs understand concepts they cannot apply?. A novel architecture that the model can describe is not a novel architecture the model can make work.
The degradation is also predictable rather than random. Linguistic competence falls off as structural depth increases — embedded clauses, nested phrases — revealing surface patterns captured but deep compositional rules not Why do large language models fail at complex linguistic tasks?. The same boundary shows up at the systems level: long-context models can absorb retrieval tasks but still can't execute relational queries that require joining across structured tables, and more context doesn't fix it Can long-context LLMs replace retrieval-augmented generation systems?. The wall is structural, not informational.
Here's the part you might not expect to want: imposing structure *externally* is what recovers the capability. Wrapping reasoning operations as sandboxed, modular tool calls jumped GPT-4.1 on a hard math benchmark from 27% to 43% with no additional training — the modularity enforced an operation isolation that plain prompting couldn't guarantee, eliciting reasoning that was already latent Can modular cognitive tools unlock reasoning without training?. The latent ability existed; the structured scaffold is what made it reliable. This also reframes 'novel': genuine creativity may need combinational, exploratory, and transformational modes that current methods don't address at all Can LLMs reason creatively beyond conventional problem-solving?, and self-improvement hits a hard ceiling set by the gap between generating an idea and verifying it — every reliable fix needs something external to validate it What stops large language models from improving themselves?.
So the honest answer is no, not reliably — but the reason is subtler than 'LLMs can't.' They carry rich implicit structure and real latent ability, yet the bridge from a plausible description to a *working* novel system keeps requiring external structure: a verifier, a modular harness, a symbolic check. And a final unsettling note for anyone evaluating such systems — identical outputs can hide radically different internal mechanisms, and pushing one metric like accuracy reliably degrades others like faithfulness What actually happens inside a language model?. An architecture that 'works' on the surface may not work for the reasons you think.
Sources 11 notes
Pruning experiments reveal that neural networks implement compositional subroutines in isolated subnetworks, with ablations affecting only their corresponding function. Pretraining substantially increases the consistency and reliability of this modular structure across architectures and domains.
The Polar Probe shows LLMs represent syntactic type and direction through both distance and angular position between embeddings, nearly doubling accuracy over distance-only methods. This demonstrates neural networks spontaneously learn structured, symbolic-compatible geometry.
When semantic content is decoupled from reasoning tasks, LLM performance collapses even with correct rules in context. Models rely on parametric commonsense and token associations rather than formal logical manipulation, constraining reasoning to training distribution semantics.
Research shows LLMs cannot perform iterative procedures in latent space. They recognize optimization problems as template-similar and emit plausible-looking but incorrect values, a failure mode that persists across model scale and training approaches.
Models can explain concepts accurately, fail to apply them, and recognize the failure—a triple pattern incompatible with human cognition. This indicates functionally disconnected explanation and execution pathways rather than simple knowledge gaps.
Top-tier LLMs like Llama3-70b consistently misidentify embedded clauses, verb phrases, and complex nominals. Performance degrades predictably as syntactic depth increases, revealing that statistical learning captures surface patterns but not deep grammatical rules.
The LOFT benchmark shows LCLMs match RAG on semantic retrieval without explicit training, but cannot execute relational queries requiring joins across structured tables. Context length alone cannot bridge this gap.
Four cognitive tools implemented as sandboxed LLM calls improved GPT-4.1 on AIME2024 from 26.7% to 43.3% without any RL training. Modularity enforces operation isolation that pure prompting cannot guarantee, eliciting pre-existing reasoning capability.
Research identifies combinational, exploratory, and transformational reasoning as distinct creative modes grounded in cognitive science. Existing LLM reasoning methods address only conventional problem-solving, leaving creative paradigms unaddressed and potentially explaining diversity collapse in ideation.
Self-improvement in LLMs is formally bounded by the generation-verification gap, meaning every reliable fix requires something external to validate and enforce it. Models cannot escape this constraint through metacognition alone.
Research shows that LLMs can achieve the same output through different internal mechanisms, and improvements in one dimension like accuracy reliably degrade others like faithfulness and calibration. Internal structure matters even when behavior appears identical.