Are newer larger language models actually worse at faithful summarization?
This reads the question as: does scaling up models actually improve faithful (source-grounded) summarization — or do bigger, newer models carry failure modes that more parameters don't fix?
This explores whether newer, larger models are genuinely worse summarizers — and the honest read of the corpus is that it doesn't contain a head-to-head benchmark showing big models *losing* faithfulness as they scale. What it does contain is something more interesting: a cluster of findings suggesting that the failures behind unfaithful summarization are structural, not capacity problems, so scale doesn't reliably cure them and can even sharpen the conditions that produce them.
The most direct mechanism is the tug-of-war between what a model learned in training and what's actually in the document in front of it. When a model's prior associations are strong, it generates output inconsistent with its own context — parametric knowledge overrides the source, and prompting alone can't fix it Why do language models ignore information in their context?. That is precisely what unfaithful summarization looks like: the model writes what it 'knows' instead of what the text says. Larger models trained on more data have *stronger* priors, which is a reason to expect this conflict to get worse, not better, with scale.
Length compounds it. Reasoning accuracy drops from 92% to 68% with only 3,000 tokens of padding — far below the context window, task-agnostic, and unhelped by chain-of-thought Does reasoning ability actually degrade with longer inputs?. Since summarization is by definition the task of compressing long input, this degradation hits it squarely. And even when long-context models can hold a whole document, they handle semantic retrieval well but break on anything requiring structured cross-referencing Can long-context LLMs replace retrieval-augmented generation systems? — so a summary that needs to faithfully track relationships across a document is exactly where they slip.
What makes this resistant to scale is that several of these failures explicitly *persist across model size*. Models pattern-match to template-similar memorized solutions rather than executing the actual procedure, a failure that holds across scale and training approach Do large language models actually perform iterative optimization?, and reasoning breaks at instance-novelty boundaries rather than complexity thresholds Do language models fail at reasoning due to complexity or novelty?. Even top-tier large models carry systematic linguistic blind spots that worsen predictably with structural complexity Why do large language models fail at complex linguistic tasks?. The throughline: bigger captures more surface pattern, not deeper fidelity to a specific source.
The one note pointing at a fix reframes the whole question. Generic 'fluent prose' summaries optimize for sounding good, not for being right about what matters downstream — and training summarizers directly against a downstream relevance signal (via RL) produces denser, more faithful, attribute-focused summaries that beat the fluent default Can reinforcement learning align summarization with ranking goals?. So the takeaway you didn't know you wanted: faithfulness may be less about model size than about what the model was *optimized to produce*. A bigger model trained to be fluent will write smoother, more confident, less faithful summaries — which can feel like 'worse' even when the raw capability went up.
Sources 7 notes
Research demonstrates that LMs generate outputs inconsistent with their context because parametric knowledge from training dominates over in-context information. Textual prompting alone cannot override strong priors; causal intervention in representations is required.
FLenQA shows reasoning accuracy drops from 92% to 68% at just 3000 tokens of padding, far below context window capacity. The degradation is task-agnostic, uncorrelated with language modeling performance, and persists even with chain-of-thought prompting.
The LOFT benchmark shows LCLMs match RAG on semantic retrieval without explicit training, but cannot execute relational queries requiring joins across structured tables. Context length alone cannot bridge this gap.
Research shows LLMs cannot perform iterative procedures in latent space. They recognize optimization problems as template-similar and emit plausible-looking but incorrect values, a failure mode that persists across model scale and training approaches.
LRMs don't break at complexity thresholds but at instance-novelty boundaries. Models fit instance-based patterns rather than generalizable algorithms, so any reasoning chain succeeds if trained on similar instances, regardless of length.
Top-tier LLMs like Llama3-70b consistently misidentify embedded clauses, verb phrases, and complex nominals. Performance degrades predictably as syntactic depth increases, revealing that statistical learning captures surface patterns but not deep grammatical rules.
ReLSum trains summarizers using downstream relevance scores as RL rewards, producing dense, attribute-focused summaries instead of fluent prose. This alignment to the actual ranking metric improves recall, NDCG, and user engagement in production e-commerce search.