INQUIRING LINE

Why are documents read but not cited harder distractors than random samples?

This explores why 'hard negatives' built from papers an author actually read but chose not to cite teach a model more than randomly drawn papers — and why that same property makes them a tougher discrimination problem.


This explores why documents that were read-but-not-cited make harder training distractors than random samples. The short version: a random sample is off-topic, so a model can reject it on surface features alone — it learns *topic*, not *judgment*. A read-but-not-cited document is on-topic by construction. Someone with expertise looked at it and decided it didn't belong. To separate it from the cited papers, a model has to learn the thing you actually care about: the fine-grained quality and relevance signal that survives after topical similarity is held constant. That's exactly the signal behind learning 'scientific taste' from community citation feedback, where models trained on citation-matched paper pairs learn to predict impact as a capability distinct from execution skill (Can models learn what makes research worth doing?).

The catch is that the harder negatives are, the more they stress the geometry of the model doing the discriminating. Training dense retrievers with structure-targeted hard negatives consistently *degrades* zero-shot generalization — an 8-40% drop — because in high-dimensional cosine space, sharpening one fine distinction warps the rest of the space (Does training for compositional sensitivity hurt dense retrieval?). So 'read-but-not-cited' negatives are valuable precisely because they're near the decision boundary, but that same nearness is what compressed vector representations struggle to honor. The difficulty isn't a bug in the data; it's the whole point, and it pushes against the limits of the representation.

This is the same wall that shows up in matching tasks: a pooled-cosine recall stage cannot reliably reject 'structural near-misses' — candidates that look right in aggregate but fail on the details — and you have to add a separate verifier that operates on full token-to-token interaction patterns rather than a single squashed vector (Can verification separate structural near-misses from topical matches?). Read-but-not-cited documents are the near-misses of the citation world. Telling them apart is a downstream verification problem, not a similarity problem, which is why throwing them in as negatives is both more informative and more demanding.

There's a learner's twist worth surfacing here, because the difficulty cuts both ways depending on who's judging. The reason on-topic-but-rejected items are hard for a *model* is the same reason they're slippery for *human and LLM evaluators*: when topical content is matched, judgment collapses onto shallow cues. LLM judges fall for authority signals and rich formatting independent of substance (Can LLM judges be fooled by fake credentials and formatting?), and users prefer responses with more citations even when those citations are irrelevant — citation count acts as a decoupled trust heuristic (Do users trust citations more when there are simply more of them?). The expert who read a paper and declined to cite it is doing the hard thing these systems fail at: discriminating on relevance after surface plausibility is already satisfied.

The payoff for using these negatives, when it works, mirrors what selective retrieval buys elsewhere: a system that learns *when* a piece of evidence actually earns its place — modeled as a per-step decision to admit or reject knowledge — eliminates noise from unnecessary inclusions and gains over indiscriminate retrieval (When should language models retrieve external knowledge versus use internal knowledge?). A random negative never teaches that admit/reject judgment because the answer is obvious. A read-but-not-cited negative teaches it because the answer was hard enough that a human had to make a call.


Sources 6 notes

Can models learn what makes research worth doing?

Reinforcement learning trained on 700K citation-matched paper pairs successfully teaches models to predict research impact better than GPT-5.2 and generate higher-impact research ideas. Scientific taste emerges as a community-aligned capability distinct from execution skills.

Does training for compositional sensitivity hurt dense retrieval?

Adding structure-targeted negatives to dense retrieval training consistently degrades zero-shot performance (8-40% nDCG@10 drop) while only partially improving compositional discrimination. This is a geometric trade-off in high-dimensional cosine spaces, not a tuning problem.

Can verification separate structural near-misses from topical matches?

A two-stage pipeline—pooled-cosine recall followed by a small Transformer verifier operating on token-token similarity maps—reliably rejects structural near-misses that MaxSim-style late interaction cannot. The verifier succeeds because it operates on full token interaction patterns rather than compressed vectors.

Can LLM judges be fooled by fake credentials and formatting?

Research identified four evaluation biases in LLM judges, with authority and beauty biases being semantics-agnostic and trivially exploitable through fake references and formatting—zero-shot attacks requiring no model access or optimization.

Do users trust citations more when there are simply more of them?

Analysis of 24,000 Search Arena interactions shows irrelevant citations boost user preference (β=0.273) nearly as much as relevant citations (β=0.285), indicating citation count functions as a decoupled trust heuristic.

When should language models retrieve external knowledge versus use internal knowledge?

DeepRAG models each reasoning step as a Markov Decision Process where the model learns when to retrieve versus rely on parametric knowledge. The 21.99% improvement comes from better-targeted retrieval and elimination of noise from unnecessary external knowledge.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are an LLM research analyst. The question: Why do documents that were read but not cited function as harder, more informative training distractors than random negative samples—and does this constraint still hold under current retriever and verifier architectures?

What a curated library found — and when (dated claims, not current truth):
Findings span 2023–2026; treat as perishable snapshots:
• Random negatives are off-topic, so models learn only surface-level topic rejection; read-but-not-cited documents force fine-grained relevance discrimination after topical match is already satisfied (~2026).
• Training dense retrievers on hard (near-boundary) negatives causes 8–40% zero-shot generalization degradation in high-dimensional cosine space, because sharpening one distinction warps the rest (~2026).
• LLM judges and human evaluators collapse onto shallow cues (authority, citation count, formatting) when topical plausibility is matched, just as models do (~2024).
• Selective retrieval—modeling per-step admit/reject decisions—eliminates noise from indiscriminate inclusion; random negatives teach no such judgment because rejection is obvious (~2024).
• Verifier-style architectures operating on full token-to-token patterns outperform single squashed vectors for structural near-miss rejection (~2024).

Anchor papers (verify; mind their dates):
• arXiv:2604.16351 (2026-03): Training for Compositional Sensitivity Reduces Dense Retrieval Generalization
• arXiv:2603.14473 (2026-03): AI Can Learn Scientific Taste
• arXiv:2402.10669 (2024-02): Humans or LLMs as the Judge?
• arXiv:2511.18659 (2025-11): CLaRa: Bridging Retrieval and Generation with Continuous Latent Reasoning

Your task:
(1) RE-TEST each constraint. For dense retriever geometry: has multi-stage ranking (retriever + verifier + ranker), retriever scaling, or modern sparse-dense hybrids (BM25 + learned fusion, Hybrid RAG, multi-vector indexing) since 2026 relaxed the 8–40% generalization drop? Does selective retrieval (per-step reasoning) now reliably learn admit/reject on hard negatives, or does it still collapse on shallow signals? Separate the durable insight (hard negatives *should* teach finer discrimination) from the perishable limitation (cosine space cannot honor near-misses).
(2) Surface the strongest recent work (last ~6 months post-2026-03) that either contradicts the geometry argument or shows verifiers/rerankers have overcome it. Cite arXiv IDs.
(3) Propose 2 research questions assuming the regime may have shifted: (a) Can modern multi-stage pipelines (e.g., hybrid retrieval + neural reranker + LLM verifier) learn to exploit read-but-not-cited negatives without the generalization penalty? (b) Do contrastive fine-tuning methods on read-but-not-cited pairs now outperform or underperform newer approaches like in-context learning or chain-of-thought ranking?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines