INQUIRING LINE

Inquiring lines›What do model internals reveal abo…›How do surface signals and framing…›How do social dynamics and selecti…›this inquiring line

When people rate items in sequence, does each score get quietly pulled by whatever they just rated before?

What anchoring effects shape how users rate items in sequence?

This reads the question as being about sequence-dependent rating bias — the way an item's score gets pulled by what came just before it, or by prior expectations the reader brings in — rather than about ranking algorithms per se.

This explores anchoring in the human sense: when people rate items one after another, does the order itself bend the numbers? The corpus doesn't have a single paper that runs the classic anchoring experiment, but it circles the same territory from several angles worth stitching together. The sharpest entry point is the finding that not every rating measures a stable inner preference at all. One line of work shows that annotation responses actually decompose into three different things — genuine preferences, non-attitudes, and preferences *constructed on the spot* — and you can tell them apart by whether they stay consistent when the measurement conditions change Do all annotation responses measure the same underlying thing?. That constructed-in-the-moment category is exactly where anchoring lives: if a rating is built fresh each time, the surrounding context (including the previous item) is part of what builds it.

The second thread is about priors. Whether connected products converge to similar ratings or diverge depends on the *type* of recommendation link between them — "frequently bought together" versus "co-viewed" networks pull ratings in different directions, because each surfaces products to a different audience carrying different expectations Do different recommender types shape opinion convergence differently?. So the anchor isn't only the last thing you saw; it's the expectation the system primed you with before you even arrived. Scaled up, feeds become persuasion infrastructure where these priming and contamination effects compound across a whole population How do recommendation feeds shape what people see and believe?.

There's also a clean example of a non-content cue hijacking judgment: people rate AI responses higher when there are simply *more* citations, even when those citations are irrelevant — citation count works as a decoupled trust heuristic, almost as strong when the sources are useless as when they're real Do users trust citations more when there are simply more of them?. That's anchoring by a surface feature rather than by position, and it's a reminder that ratings latch onto whatever salient signal is cheapest to read.

On the machine side, sequence order turns out to be a latent variable that's easy to ignore and easy to recover. Language models doing ranking disregard the temporal order of a user's history by default, but recency-focused prompts switch the sensitivity back on Why do language models ignore temporal order in ranking? — and recency-weighted recall beats similarity-weighted recall when summarizing what a user actually prefers Does abstract preference knowledge outperform specific interaction recall?. Recency is the algorithmic cousin of a recency anchor: the most recent item gets disproportionate weight unless you deliberately correct for it. Even in conversational recommendation, the *order* items get mentioned carries dependency information that bag-of-mentions models throw away Does conversation order matter for recommending items in dialogue?.

The honest synthesis: the collection has strong material on the *ingredients* of anchoring — constructed preferences, priming by prior expectation, surface-cue heuristics, recency weighting, and order-dependence — but no study that isolates a numeric anchoring effect in human sequential ratings directly. If that exact effect is what you're chasing, the constructed-preference and recommender-convergence pieces are the closest doorways, and they suggest the more useful question isn't "is there an anchor" but "which signal is the anchor borrowing its weight from."

Sources 7 notes

Do all annotation responses measure the same underlying thing?

Behavioral science reveals that annotations contain genuine preferences, non-attitudes, and constructed preferences—distinguishable by consistency across measurement conditions. Treating them uniformly contaminates reward model training and downstream alignment.

Do different recommender types shape opinion convergence differently?

Research shows that frequently-bought-together and co-viewed recommendation networks produce different opinion convergence patterns. The mechanism: each recommender type attracts different audience segments with different prior expectations, shaping both who sees products together and how they rate them.

How do recommendation feeds shape what people see and believe?

Research shows recommendation systems operate as political actors: feed weights influence producer behavior, network topology drives opinion convergence, and automation enables targeted persuasion at population scale. These effects compound through rating contamination and selection biases.

Do users trust citations more when there are simply more of them?

Analysis of 24,000 Search Arena interactions shows irrelevant citations boost user preference (β=0.273) nearly as much as relevant citations (β=0.285), indicating citation count functions as a decoupled trust heuristic.

Why do language models ignore temporal order in ranking?

LLMs can extract preferences from interaction histories but disregard temporal order by default. Recency-focused prompts and in-context examples activate latent order-sensitivity, improving ranking without retraining.

Show all 7 sources

Does abstract preference knowledge outperform specific interaction recall?

PRIME framework shows semantic memory (preference summaries, parametric encodings) consistently beats episodic memory (retrieved past interactions) across models. Recency-based recall outperforms similarity-based retrieval, and task fine-tuning exceeds preference tuning methods.

Does conversation order matter for recommending items in dialogue?

TSCR models items and entities in the order they appear in CRS dialogue, using transformers to learn dependencies between sequential mentions. This recovers information that bag-of-mentions approaches discard, improving recommendation accuracy on standard benchmarks.

Papers this line draws on 8

The research behind the notes this line reads — ranked by how closely each paper relates.

Large Language Models are Zero-Shot Rankers for Recommender Systems1.70 match · arxiv ↗
Preference Discerning with LLM-Enhanced Generative Retrieval1.63 match · arxiv ↗
Flattery, Fluff, and Fog: Diagnosing and Mitigating Idiosyncratic Biases in Preference Models1.61 match · arxiv ↗
Calibrated Recommendations1.59 match · arxiv ↗
A Probabilistic Model for Using Social Networks in Personalized Item Recommendation1.58 match · arxiv ↗
Collaborative Filtering with Temporal Dynamics1.58 match · arxiv ↗
From speaking like a person to being personal: The effects of personalized, regular interactions with conversational agents1.56 match · arxiv ↗
Improving Conversational Recommender Systems via Transformer-based Sequential Modelling0.93 match · arxiv ↗

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst re-testing dated claims about anchoring in sequential rating tasks. The question remains open: *which contextual signals (recency, priming, surface cues, constructed preferences) actually bend human ratings, and how do LLMs replicate or diverge from those biases?*

What a curated library found — and when (dated claims, not current truth):
Findings span 2020–2026; treat them as perishable constraints:

• Annotation responses decompose into three types: genuine preferences, non-attitudes, and preferences constructed on-the-spot. Constructed ratings are context-sensitive and vulnerable to anchoring (~2023–2025).
• Recommendation graph structure ("frequently bought together" vs. "co-viewed") shapes opinion convergence differently, priming users *before* they rate (~2023).
• Users rate AI responses higher when citation count is high, even if citations are irrelevant — a surface heuristic anchor decoupled from content (~2025).
• LLMs as zero-shot rankers ignore temporal order of user history by default; recency-focused prompts re-enable sequence sensitivity (~2023).
• Conversational recommendation systems discard order-dependence between items mentioned; sequential structure carries preference signal (~2023–2025).

Anchor papers (verify; mind their dates):
• arXiv:2305.08845 (2023-05): Large Language Models are Zero-Shot Rankers for Recommender Systems
• arXiv:2604.03238 (2026-01): Measuring Human Preferences in RLHF is a Social Science Problem
• arXiv:2507.13705 (2025-07): Consistent Explainers or Unreliable Narrators? Understanding LLM-generated Group Recommend
• arXiv:2507.04607 (2025-07): PRIME: Large Language Model Personalization with Cognitive Memory and Thought Processes

Your task:
(1) **RE-TEST EACH CONSTRAINT.** For each finding above, ask: have newer LLMs, in-context learning techniques (chain-of-thought, explicit recency weighting, memory architectures like PRIME), or multi-turn evaluation harnesses since RELAXED or OVERTURNED it? Separate the durable question ("do sequential signals shape preference formation?") from the perishable limitation ("LLMs default-ignore recency"). Cite what resolved it; flag constraints that still hold.
(2) **Surface the strongest CONTRADICTING or SUPERSEDING work** from the last ~6 months. Does arXiv:2604.03238 or arXiv:2507.04607 challenge the "constructed preference" thesis? Does arXiv:2506.05334 (Search Arena) reveal new anchoring mechanisms via search-augmented LLMs?
(3) **Propose 2 research questions** that assume the regime may have moved: (a) Can explicit memory systems (e.g., PRIME's cognitive memory) *suppress* anchoring by decoupling episodic order from preference aggregation? (b) Do multi-agent or agentic recommendation loops *amplify* or *dampen* anchoring as they iterate?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

When people rate items in sequence, does each score get quietly pulled by whatever they just rated before?

Related lines of inquiry

Sources 7 notes

Papers this line draws on 8