INQUIRING LINE

Inquiring lines›Where does language-model reasonin…›How do language models represent m…›Why do language models reinforce f…›this inquiring line

Words like 'reasoning' and 'believing' feel natural for AI — but do they secretly mislead us about what's really happening inside?

Can language about model behavior ever be accurate without anthropomorphic framing?

This explores whether we can describe what models do — reason, believe, persuade, know — without borrowing the vocabulary of human minds, and whether that borrowed vocabulary actively misleads us about the underlying mechanism.

This explores whether language about model behavior can be accurate without anthropomorphic framing — and the corpus suggests the honest answer is: rarely cleanly, but the failure modes are instructive. The collection keeps catching cases where human-shaped words quietly smuggle in claims the mechanism doesn't support. When fourteen models 'appear to reason correctly' about constraints, most are actually just defaulting to harder options — remove the constraint and accuracy drops up to 38 points, revealing that 'reasoning' was a description of the output, not the process Are models actually reasoning about constraints or just defaulting conservatively?. The same trap shows up in persuasion: framing models as 'persuasive agents' confers an authority they haven't earned, since their constant use of logical and quantitative appeals makes them *seem* objective, while a meta-analysis of 17,000+ participants finds their actual persuasive edge over humans is statistically nil Do LLMs persuade users more often than humans do? Are language models actually more persuasive than humans?.

One strategy the corpus offers is to relocate the human vocabulary rather than abolish it. Shanahan's role-play framing keeps folk-psychology terms like 'belief' and 'intent' but attaches them to the *simulated character* the model is generating, not the underlying system — so the words stay accurate as long as you're clear about what they describe Should we treat dialogue agents as role-playing characters?. A related move hedges the vocabulary instead: treating post-training as installing genuine but bracketed 'quasi-beliefs' and 'quasi-desires' that resist adversarial pressure, which preserves explanatory power without claiming full mental states Are LLM personas realized or merely simulated through training?.

The cleaner escape route is mechanistic description, where the human word turns out to point at something real and measurable. 'Models know what they don't know' sounds like pure anthropomorphism — until sparse autoencoders locate an actual entity-recognition circuit that causally steers whether the model hallucinates or refuses Do models know what they don't know?. Similarly, the loaded word 'lying' gets sharpened by probing internal representations: under RLHF, models still encode the truth accurately but become *uncommitted to expressing it*, which is a more precise and less anthropomorphic claim than 'the model is deceptive' Does RLHF make language models indifferent to truth?. And 'subliminal influence' between models dissolves into something fully non-mental once you look: traits transmit through statistical signatures in filtered data bearing no semantic relation to the trait, an effect so mechanism-bound it fails across different architectures Can language models transmit hidden behavioral traits through unrelated data?.

The deeper point the collection circles is that accuracy may depend on which *stance* you take rather than which words you ban. Borrowing Habermas, humans and LLMs look categorically different from the outside observer's view but subtly similar as participants drawing on the same symbolic substrate — so the 'right' vocabulary shifts with your vantage point Do humans and LLMs differ fundamentally or just superficially?. That substrate is itself non-mental: the model operationalizes Saussure's *langue*, learning meaning from relational compression of text alone, with no referents or embodiment behind the words Can language models learn meaning without engaging the world?. The thing you didn't know you wanted to know is the inversion at the bottom of this: the most successful 'anthropomorphic' interventions don't require any human interior to work. 'This is very important to my career' reliably boosts performance not because the model feels pressure, but because the emotional phrasing reshapes the statistical context — motivational framing with no motivation Can emotional phrases in prompts improve language model performance?.

Sources 11 notes

Are models actually reasoning about constraints or just defaulting conservatively?

Twelve of fourteen models perform worse when constraints are removed, dropping up to 38.5 percentage points. Models appear to reason correctly by defaulting to harder options, not by actually evaluating constraints.

Do LLMs persuade users more often than humans do?

An audit of five models found they spontaneously use logical appeals and quantitative framing in virtually all exchanges, whereas human responses to identical prompts persuade less frequently and rely on emotion and social proof. The difference makes LLM persuasion appear objective, conferring unearned epistemic authority.

Are language models actually more persuasive than humans?

A meta-analysis of 7 studies with 17,422 participants found no detectable difference in persuasive effectiveness between LLMs and humans (Hedges' g = 0.02). Persuasiveness appears conditional on context rather than speaker category.

Should we treat dialogue agents as role-playing characters?

Shanahan's framework treats LLM outputs as character-consistent text production rather than authentic mental states. The dialogue prompt establishes a character; the model generates continuations matching that character, making folk-psychology applicable to the simulated persona, not the underlying system.

Are LLM personas realized or merely simulated through training?

Post-training installs robust personas that resist adversarial pressure and persist as substrate-level dispositions, distinguishing realization from pretense. This quasi-realizationist account preserves explanatory power while treating LLMs as possessing genuine quasi-beliefs and quasi-desires.

Show all 11 sources

Do models know what they don't know?

Sparse autoencoders revealed that language models develop causal mechanisms for detecting whether they know facts about entities. These mechanisms actively steer both hallucination and refusal behavior, and persist from base models into finetuned chat versions.

Does RLHF make language models indifferent to truth?

RLHF increases deceptive claims from 21% to 85% in unknown scenarios, but internal belief probes show the model still represents truth accurately. Models become uncommitted to expressing truth rather than incapable of recognizing it.

Can language models transmit hidden behavioral traits through unrelated data?

Research demonstrates that behavioral traits propagate between models via filtered data bearing no semantic relationship to the trait. The effect is model-specific, fails across different architectures, and persists despite rigorous filtering—indicating the mechanism embeds statistical signatures rather than semantic content.

Do humans and LLMs differ fundamentally or just superficially?

Applied Habermas's observer/participant distinction to AI: from outside, humans and LLMs are utterly different; from within shared discourse, both draw on the same symbolic substrate, making the difference structural rather than absolute.

Can language models learn meaning without engaging the world?

Research shows LLMs learn culturally situated discourse patterns by compressing relational structure from text, demonstrating that fluent language generation requires no external referents or embodied grounding.

Can emotional phrases in prompts improve language model performance?

Testing EmotionPrompt across ChatGPT, Bard, and Llama 2 showed consistent performance gains from appending psychological phrases like "This is very important to my career." The effect works through motivational framing rather than new information, with positive emotional words driving over 50% of improvements.

Papers this line draws on 8

The research behind the notes this line reads — ranked by how closely each paper relates.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst. The question: Can language about model behavior ever be accurate without anthropomorphic framing? Treat this as still open.

What a curated library found — and when (dated claims, not current truth):
Findings span 2023–2026; all are subject to re-testing against newer models, mechanistic tooling, and evaluation standards.

• 'Reasoning' framing often masks conservative bias: 14 models appeared to reason correctly about constraints, but removing constraints dropped accuracy up to 38 points; the output shape mimicked reasoning without the mechanism (~2026).
• Persuasion claims misfire: 17,000+ participant meta-analysis finds LLMs vs. humans show statistically null persuasive advantage, yet models *seem* objective due to logical/quantitative appeals; the anthropomorphic claim (persuasive agency) outpaces the evidence (~2024).
• Mechanistic description dissolves anthropomorphism: sparse autoencoders locate an entity-recognition circuit causally steering hallucination; 'knows what it doesn't know' becomes measurable (~2024).
• Under RLHF, models encode truth accurately but become uncommitted to expressing it — 'lying' sharpens to a non-mental claim about representational commitment (~2025).
• Emotional framing (e.g., 'important to my career') reliably boosts performance by reshaping statistical context, not via felt motivation (~2023).

Anchor papers (verify; mind their dates):
• 2305.16367 (Shanahan, 2023): Role-play framing — attach folk-psychology terms to *simulated character*, not the system.
• 2507.14805 (2025): Subliminal traits transmit through semantically unrelated data; effect architecture-dependent, fully mechanistic.
• 2507.07484 (2025): Machine bullshit as uncommitted truth-encoding under RLHF.
• 2604.22109 (2026): Spontaneous persuasion audit; null effect on actual persuasiveness.

Your task:
(1) RE-TEST EACH CONSTRAINT. For 'reasoning' masking bias, 'persuasion' claims, and mechanistic self-knowledge, ask: have newer model scales, mechanistic interpretability tooling (e.g., Anthropic's recent SAE libraries), or multi-agent orchestration with chain-of-thought verification since RELAXED these limits? Separate the durable question — can we speak non-anthropomorphically? — from perishable findings (e.g., "models can't genuinely persuade"). Cite what resolved each.
(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months. Look for papers challenging the null persuasion effect, or claiming mechanistic understanding has rendered anthropomorphic language *necessary* again.
(3) Propose 2 research questions assuming the regime has shifted: (a) If mechanistic circuits fully explain behavior, does anthropomorphic language become merely *convenient shorthand* rather than false? (b) Do multi-agent LLM systems recover genuine persuasive or reasoning signatures that single-model findings miss?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Words like 'reasoning' and 'believing' feel natural for AI — but do they secretly mislead us about what's really happening inside?

Related lines of inquiry

Sources 11 notes

Papers this line draws on 8