SYNTHESIS NOTE
Reasoning, Retrieval, and Evaluation

Does reasoning ability actually degrade with longer inputs?

Explores whether modern language models can maintain reasoning performance when processing long contexts, and whether technical capacity translates to practical reasoning capability over extended text.

Synthesis note · 2026-02-22 · sourced from Reasoning Logic Internal Rules
What makes chain-of-thought reasoning actually work? How should researchers navigate LLM reasoning research? Where do retrieval systems fail and why?

The FLenQA benchmark exposes a critical gap between technical context window capacity and actual reasoning capacity over long inputs. By embedding simple reasoning tasks (True/False questions requiring integration of two information pieces) within irrelevant padding text of varying lengths, the paper shows that reasoning accuracy drops from 0.92 to 0.68 at just 3000 tokens — far below any modern model's context window.

Three findings make this particularly concerning:

1. The degradation is task-agnostic. Regardless of whether padding text is similar or dissimilar to the reasoning content, and regardless of where the information pieces are embedded within the context, similar degradation trends appear. The failure is not about content interference but about attention dilution over length.

2. Next-word prediction performance is uncorrelated with reasoning performance. Models that maintain strong perplexity on long inputs still fail at reasoning over those inputs. This means language modeling benchmarks on long contexts are misleading indicators of actual long-context utility — a model can "understand" the text (predict tokens well) while failing to reason over it.

3. CoT does not mitigate proportionally. Chain-of-thought prompting increases accuracy roughly uniformly across context lengths but does not close the length-induced gap. The degradation persists under CoT because the bottleneck is in information retrieval from context, not in reasoning over retrieved information.

This is a complementary mechanism to Why do language models fail at temporal reasoning in complex tasks?. That failure is about task complexity; this is about input noise. Together they define a two-dimensional reliability surface: reasoning degrades with both task complexity AND input length, and the two dimensions are independent.

The implication for RAG systems is direct: retrieved documents add to input length, and if that length includes irrelevant passages (as it typically does), reasoning over the retrieved content degrades even when the relevant information is present. Since Why does vanilla RAG produce shallow and redundant results?, the length degradation explains part of why static retrieval fails — more retrieved documents means more padding means worse reasoning.

A complementary training-time finding complicates this picture. "Longer Context, Deeper Thinking" (2025) shows that models with stronger long-context capacity (128k vs 32k) consistently achieve higher accuracy on mathematical reasoning benchmarks (MATH500 and AIME) — even when test-time inputs are short. Long-context training benefits reasoning as a foundation, not just for processing long inputs. The implication: the inference-time degradation documented in this note coexists with a training-time benefit. Models trained on longer contexts develop better reasoning foundations, but at inference time, longer inputs still degrade performance. The two findings are compatible: long-context training may improve the base reasoning capability, while inference-time input length introduces the noise and distraction effects that degrade it. Source: Arxiv/Evaluations.

Inquiring lines that use this note as a source 145

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 5

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map
22 direct connections · 213 in 2-hop network ·dense cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

reasoning performance degrades with input length even far below context window limits