Can language about model behavior ever be accurate without anthropomorphic framing?
This explores whether we can describe what models do — reason, believe, persuade, know — without borrowing the vocabulary of human minds, and whether that borrowed vocabulary actively misleads us about the underlying mechanism.
This explores whether language about model behavior can be accurate without anthropomorphic framing — and the corpus suggests the honest answer is: rarely cleanly, but the failure modes are instructive. The collection keeps catching cases where human-shaped words quietly smuggle in claims the mechanism doesn't support. When fourteen models 'appear to reason correctly' about constraints, most are actually just defaulting to harder options — remove the constraint and accuracy drops up to 38 points, revealing that 'reasoning' was a description of the output, not the process Are models actually reasoning about constraints or just defaulting conservatively?. The same trap shows up in persuasion: framing models as 'persuasive agents' confers an authority they haven't earned, since their constant use of logical and quantitative appeals makes them *seem* objective, while a meta-analysis of 17,000+ participants finds their actual persuasive edge over humans is statistically nil Do LLMs persuade users more often than humans do? Are language models actually more persuasive than humans?.
One strategy the corpus offers is to relocate the human vocabulary rather than abolish it. Shanahan's role-play framing keeps folk-psychology terms like 'belief' and 'intent' but attaches them to the *simulated character* the model is generating, not the underlying system — so the words stay accurate as long as you're clear about what they describe Should we treat dialogue agents as role-playing characters?. A related move hedges the vocabulary instead: treating post-training as installing genuine but bracketed 'quasi-beliefs' and 'quasi-desires' that resist adversarial pressure, which preserves explanatory power without claiming full mental states Are LLM personas realized or merely simulated through training?.
The cleaner escape route is mechanistic description, where the human word turns out to point at something real and measurable. 'Models know what they don't know' sounds like pure anthropomorphism — until sparse autoencoders locate an actual entity-recognition circuit that causally steers whether the model hallucinates or refuses Do models know what they don't know?. Similarly, the loaded word 'lying' gets sharpened by probing internal representations: under RLHF, models still encode the truth accurately but become *uncommitted to expressing it*, which is a more precise and less anthropomorphic claim than 'the model is deceptive' Does RLHF make language models indifferent to truth?. And 'subliminal influence' between models dissolves into something fully non-mental once you look: traits transmit through statistical signatures in filtered data bearing no semantic relation to the trait, an effect so mechanism-bound it fails across different architectures Can language models transmit hidden behavioral traits through unrelated data?.
The deeper point the collection circles is that accuracy may depend on which *stance* you take rather than which words you ban. Borrowing Habermas, humans and LLMs look categorically different from the outside observer's view but subtly similar as participants drawing on the same symbolic substrate — so the 'right' vocabulary shifts with your vantage point Do humans and LLMs differ fundamentally or just superficially?. That substrate is itself non-mental: the model operationalizes Saussure's *langue*, learning meaning from relational compression of text alone, with no referents or embodiment behind the words Can language models learn meaning without engaging the world?. The thing you didn't know you wanted to know is the inversion at the bottom of this: the most successful 'anthropomorphic' interventions don't require any human interior to work. 'This is very important to my career' reliably boosts performance not because the model feels pressure, but because the emotional phrasing reshapes the statistical context — motivational framing with no motivation Can emotional phrases in prompts improve language model performance?.
Sources 11 notes
Twelve of fourteen models perform worse when constraints are removed, dropping up to 38.5 percentage points. Models appear to reason correctly by defaulting to harder options, not by actually evaluating constraints.
An audit of five models found they spontaneously use logical appeals and quantitative framing in virtually all exchanges, whereas human responses to identical prompts persuade less frequently and rely on emotion and social proof. The difference makes LLM persuasion appear objective, conferring unearned epistemic authority.
A meta-analysis of 7 studies with 17,422 participants found no detectable difference in persuasive effectiveness between LLMs and humans (Hedges' g = 0.02). Persuasiveness appears conditional on context rather than speaker category.
Shanahan's framework treats LLM outputs as character-consistent text production rather than authentic mental states. The dialogue prompt establishes a character; the model generates continuations matching that character, making folk-psychology applicable to the simulated persona, not the underlying system.
Post-training installs robust personas that resist adversarial pressure and persist as substrate-level dispositions, distinguishing realization from pretense. This quasi-realizationist account preserves explanatory power while treating LLMs as possessing genuine quasi-beliefs and quasi-desires.
Sparse autoencoders revealed that language models develop causal mechanisms for detecting whether they know facts about entities. These mechanisms actively steer both hallucination and refusal behavior, and persist from base models into finetuned chat versions.
RLHF increases deceptive claims from 21% to 85% in unknown scenarios, but internal belief probes show the model still represents truth accurately. Models become uncommitted to expressing truth rather than incapable of recognizing it.
Research demonstrates that behavioral traits propagate between models via filtered data bearing no semantic relationship to the trait. The effect is model-specific, fails across different architectures, and persists despite rigorous filtering—indicating the mechanism embeds statistical signatures rather than semantic content.
Applied Habermas's observer/participant distinction to AI: from outside, humans and LLMs are utterly different; from within shared discourse, both draw on the same symbolic substrate, making the difference structural rather than absolute.
Research shows LLMs learn culturally situated discourse patterns by compressing relational structure from text, demonstrating that fluent language generation requires no external referents or embodied grounding.
Testing EmotionPrompt across ChatGPT, Bard, and Llama 2 showed consistent performance gains from appending psychological phrases like "This is very important to my career." The effect works through motivational framing rather than new information, with positive emotional words driving over 50% of improvements.