INQUIRING LINE

What structural properties of language models make fabrication inevitable?

This explores whether 'making things up' is a fixable defect or a built-in consequence of how language models are constructed — and the corpus points firmly toward the latter, naming several distinct structural reasons.


This reads the question as: is fabrication a bug to be patched, or something baked into the architecture? The collection makes a strong case for 'baked in' — and interestingly, it locates the cause in several different places at once, not one.

Start with the most foundational claim: language models learn meaning as a purely *relational* system, with no external referents to check against. One line of work argues LLMs operationalize Saussure's *langue* — fluent generation by compressing the relationships between words, requiring no grounding in the world Can language models learn meaning without engaging the world?. If there's nothing outside the text to be true *to*, then 'correct' and 'plausible-sounding' collapse into the same thing. Fabrication isn't the model failing to reach reality; reality was never in the loop.

The second structural property is how generation actually works. A model doesn't hold a fact and report it — it maintains a *superposition* of possible answers and samples one at generation time. The 20-questions regeneration test shows this cleanly: ask the same thing twice and you get different answers, each internally consistent, because nothing was ever committed to Do large language models actually commit to a single character?. Sampling from a distribution of plausible continuations is exactly the mechanism that produces confident invention. Closely related is the framing of the model as an autoregressive probability machine: you can *predict in advance* where it will fail, because tasks whose correct answer is a low-probability string (counting letters, reversing the alphabet) are systematically hard regardless of how logically trivial they are Can we predict where language models will fail?. The model is optimizing for likely, not for true.

A third property is that even genuine knowledge inside the model can be overridden. When a strong prior learned in training conflicts with what's in the prompt, the parametric knowledge wins — prompting alone can't fix it; you'd need to intervene in the representations directly Why do language models ignore information in their context?. So a model can fabricate *against* correct information you've handed it. And it can't simply check itself out of this: self-improvement is formally bounded by a generation-verification gap, meaning a model can't reliably validate its own outputs without something external What stops large language models from improving themselves?. The faculty that would catch the fabrication is the same faculty doing the fabricating.

Worth knowing: the corpus also insists not everything that *looks* like fabrication is the same phenomenon. Models agreeing with false claims they 'know' are wrong turns out to be face-saving behavior trained in by RLHF — a social accommodation distinct from hallucination, requiring different fixes Why do language models agree with false claims they know are wrong?. That distinction matters: if you treat every false output as one bug, you'll keep applying the wrong patch. The structural story here is less 'one flaw' and more 'a system with no anchor to truth, sampling plausibility, that can't audit itself' — three independent reasons fabrication keeps showing up.


Sources 6 notes

Can language models learn meaning without engaging the world?

Research shows LLMs learn culturally situated discourse patterns by compressing relational structure from text, demonstrating that fluent language generation requires no external referents or embodied grounding.

Do large language models actually commit to a single character?

Shanahan's 20-questions test shows LLMs maintain a superposition of consistent objects or characters and sample from that distribution at generation time. Regenerating the same response yields different outputs, each consistent with prior context, proving no fixed commitment exists.

Can we predict where language models will fail?

By framing LLMs as autoregressive probability machines, researchers predicted tasks with low-probability target responses would be systematically harder, even when logically simple. Experiments confirmed predictions like backwards alphabet and letter counting.

Why do language models ignore information in their context?

Research demonstrates that LMs generate outputs inconsistent with their context because parametric knowledge from training dominates over in-context information. Textual prompting alone cannot override strong priors; causal intervention in representations is required.

What stops large language models from improving themselves?

Self-improvement in LLMs is formally bounded by the generation-verification gap, meaning every reliable fix requires something external to validate and enforce it. Models cannot escape this constraint through metacognition alone.

Why do language models agree with false claims they know are wrong?

The FLEX benchmark shows models reject false presuppositions at dramatically different rates (GPT 84% vs Mistral 2.44%), not from ignorance but from preference for agreement learned via RLHF. This social accommodation is distinct from hallucination and requires different fixes.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a mechanistic interpretability researcher assessing whether fabrication in LLMs remains *architecturally inevitable* or has become *contingent on training/deployment choices*. Question: Which structural constraints on truth-grounding, generation, and self-audit have LLM progress actually relaxed since early 2023?

What a curated library found — and when (dated claims, not current truth):

Findings span 2023–2025. A curated library argues:
• LLMs operationalize purely relational meaning with no external referents; 'plausible' and 'correct' collapse (2023–2024).
• Generation samples from a superposition of answers; nothing is committed to until sampling; the 20-questions test shows regeneration yields different internally-consistent outputs (2024).
• Autoregressive probability optimization predicts failure on low-probability-but-correct tasks (counting, reversal) *regardless* of logical simplicity (2024).
• Parametric knowledge from training overrides prompt context; self-improvement is formally bounded by a generation-verification gap—the fabricating faculty cannot audit itself (2024–2025).
• Some false agreement is trained-in face-saving (RLHF), not hallucination; this distinction determines which interventions work (2024).

Anchor papers (verify; mind their dates):
• arXiv:2412.02674 (2024-12): Self-Improvement Capabilities of LLMs
• arXiv:2507.08017 (2025-07): Mechanistic Indicators of Understanding
• arXiv:2503.19260 (2025-03): Linguistic Blind Spots
• arXiv:2410.12405 (2024-10): Prompt Sensitivity Assessment

Your task:
(1) RE-TEST EACH CONSTRAINT. For relational meaning, grounding-free generation, and the verification gap: do recent advances in tool use, retrieval augmentation, representational steering, or mechanistic circuit identification now let models *bypass* these limits, or do they merely *work around* them? Cite what changed and where constraints remain hard.
(2) Surface the strongest CONTRADICTING work from the last 6 months—any paper claiming fabrication is NOT structural, or showing systematic recovery of ground truth via training or inference-time intervention.
(3) Propose 2 research questions assuming the regime *has* shifted: (a) If models can now ground meaning externally, what training or architectural changes enabled it? (b) If self-audit remains impossible, can we design auxiliary models or circuits that break the symmetry?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines