Why does training data saliency distort how models judge meaning?
This explores how the sheer statistical weight of frequent or strongly-associated training patterns can override actual meaning when a model decides what a text 'says' — and what that reveals about whether models judge meaning at all.
This explores how the statistical mass of training data — how often a phrasing or association appeared — can crowd out meaning when a model evaluates text. The clearest evidence is direct: models systematically prefer high-frequency surface forms over rarer but semantically identical paraphrases, and this bias holds across math, translation, commonsense reasoning, and tool calling Do language models really understand meaning or just surface frequency?. That consistency is the tell — it suggests the model is tracking how much statistical weight a form carries from pretraining, not recognizing what it means.
The same dynamic shows up as a tug-of-war between what's in front of the model and what it learned. When prior training associations are strong enough, models generate outputs that contradict their own context — and you can't fix this by prompting harder; it takes intervening directly in the model's internal representations Why do language models ignore information in their context?. Saliency, in other words, isn't a surface quirk you can argue the model out of. There's even a measurable threshold to it: how strongly a keyword gets primed after training is predictable from its probability beforehand, with a sharp cutoff around one-in-a-thousand separating words that 'stick' from those that don't, after as few as three exposures Can we predict keyword priming before learning happens?.
Why would frequency dominate meaning in the first place? One camp argues it's structural: meaning requires linking expressions to communicative intent, and a system trained only on form-to-form prediction never has access to that, so it can only ever reconstruct statistical regularity Can language models learn meaning from text patterns alone?. But the corpus doesn't let that conclusion sit unchallenged. Other work shows LLMs operationalize Saussure's *langue* — they compress the relational structure of language so well that fluent, situated generation needs no external referent Can language models learn meaning without engaging the world? — and that even static embeddings, before attention runs, encode genuine semantic content like valence and concreteness Do transformer static embeddings actually encode semantic meaning?. So the distortion may be less 'no meaning' and more 'real semantic signal getting drowned out by louder frequency signal.'
Where saliency does the most damage is in cases that demand holding more than one reading at once. Models fail badly at recognizing deliberate ambiguity — GPT-4 disambiguates only 32% of cases where humans hit 90% — because they collapse to a single dominant interpretation instead of entertaining the alternatives Can language models recognize when text is deliberately ambiguous?. That's saliency as a failure of plurality: the most-trained reading wins by default. It's worth contrasting with how humans interpret, where disagreement across social positions is irreducible and meaningful rather than noise to be averaged away Why do readers interpret the same sentence so differently?. The thing you didn't know you wanted to know: the distortion isn't only that models pick the frequent reading — it's that they can't even see the existence of the others.
Sources 8 notes
LLMs show consistent preference for higher-frequency surface forms over semantically equivalent rare paraphrases across math, machine translation, commonsense reasoning, and tool calling. This suggests models track statistical mass from pretraining rather than meaning-recognition as their primary mechanism.
Research demonstrates that LMs generate outputs inconsistent with their context because parametric knowledge from training dominates over in-context information. Textual prompting alone cannot override strong priors; causal intervention in representations is required.
Pre-learning keyword probability strongly predicts post-learning priming across architectures and model sizes, with a ~10^-3 threshold separating contexts where priming occurs from those where it doesn't. Just 3 training exposures suffice to establish the effect.
Bender & Koller argue that meaning requires the relation between expressions and communicative intents. Since LLMs are trained only on form-to-form prediction with no access to shared attention or intent, they cannot reconstruct the meaning that grounds language.
Research shows LLMs learn culturally situated discourse patterns by compressing relational structure from text, demonstrating that fluent language generation requires no external referents or embodied grounding.
Clustering analysis of RoBERTa embeddings reveals sensitivity to five psycholinguistic measures including valence, concreteness, iconicity, and taboo. This demonstrates that static embeddings function as genuine lexical entries containing semantic content before self-attention operates.
AMBIENT benchmark shows GPT-4 correctly disambiguates only 32% of cases versus 90% for humans. This failure spans lexical, structural, and scope ambiguity—revealing that LLMs cannot hold multiple interpretations simultaneously, a fundamental gap hidden by standard benchmarks.
Interpretation Modeling research shows that disagreement on socially embedded sentences reflects valid differences in reader perspective, not annotation failure. Structured human disagreement in NLI benchmarks confirms that interpretation distributions carry meaningful information.