Does reasoning ability actually degrade with longer inputs?
Explores whether modern language models can maintain reasoning performance when processing long contexts, and whether technical capacity translates to practical reasoning capability over extended text.
The FLenQA benchmark exposes a critical gap between technical context window capacity and actual reasoning capacity over long inputs. By embedding simple reasoning tasks (True/False questions requiring integration of two information pieces) within irrelevant padding text of varying lengths, the paper shows that reasoning accuracy drops from 0.92 to 0.68 at just 3000 tokens — far below any modern model's context window.
Three findings make this particularly concerning:
1. The degradation is task-agnostic. Regardless of whether padding text is similar or dissimilar to the reasoning content, and regardless of where the information pieces are embedded within the context, similar degradation trends appear. The failure is not about content interference but about attention dilution over length.
2. Next-word prediction performance is uncorrelated with reasoning performance. Models that maintain strong perplexity on long inputs still fail at reasoning over those inputs. This means language modeling benchmarks on long contexts are misleading indicators of actual long-context utility — a model can "understand" the text (predict tokens well) while failing to reason over it.
3. CoT does not mitigate proportionally. Chain-of-thought prompting increases accuracy roughly uniformly across context lengths but does not close the length-induced gap. The degradation persists under CoT because the bottleneck is in information retrieval from context, not in reasoning over retrieved information.
This is a complementary mechanism to Why do language models fail at temporal reasoning in complex tasks?. That failure is about task complexity; this is about input noise. Together they define a two-dimensional reliability surface: reasoning degrades with both task complexity AND input length, and the two dimensions are independent.
The implication for RAG systems is direct: retrieved documents add to input length, and if that length includes irrelevant passages (as it typically does), reasoning over the retrieved content degrades even when the relevant information is present. Since Why does vanilla RAG produce shallow and redundant results?, the length degradation explains part of why static retrieval fails — more retrieved documents means more padding means worse reasoning.
A complementary training-time finding complicates this picture. "Longer Context, Deeper Thinking" (2025) shows that models with stronger long-context capacity (128k vs 32k) consistently achieve higher accuracy on mathematical reasoning benchmarks (MATH500 and AIME) — even when test-time inputs are short. Long-context training benefits reasoning as a foundation, not just for processing long inputs. The implication: the inference-time degradation documented in this note coexists with a training-time benefit. Models trained on longer contexts develop better reasoning foundations, but at inference time, longer inputs still degrade performance. The two findings are compatible: long-context training may improve the base reasoning capability, while inference-time input length introduces the noise and distraction effects that degrade it. Source: Arxiv/Evaluations.
Inquiring lines that use this note as a source 145
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- How can we measure whether assistance preserved the user's reasoning state?
- How do readers selectively hold frame-related words in mind?
- Can input augmentation and rephrasing compensate for smaller model limitations?
- How does the knowing-doing gap widen as tasks become more complex?
- How do transformers perform multi-hop reasoning across distant training documents?
- Does sentence-level granularity capture enough structure for complex reasoning tasks?
- Does irrelevant content degrade reasoning even when it fits the context window?
- How does SONAR embedding quality affect downstream reasoning accuracy?
- What distinguishes genuine reasoning activation from memorization-assisted answer recall?
- Can meaning-level metrics like Semantic Entropy avoid length bias?
- Can latent reasoning architectures work as retrofits to existing models?
- Can context compression preserve what matters without introducing bias?
- What happens to anaphoric reference when context exceeds the window?
- What makes a background condition relevant to a specific reasoning task?
- Why does long-form generation need different retrieval than factoid questions?
- Can manipulative prompts reduce reasoning model accuracy without fine-tuning?
- Why do longer queries benefit less from clarification questions?
- What makes active reasoning through dialogue harder than passive reasoning?
- Where do humans and language models actually diverge in reasoning ability?
- Can marginal hints integrate better into reasoning than comprehensive explanations?
- How does era sensitivity in legal cases compound with context length failures?
- Why do correct reasoning traces in language models tend to be shorter?
- How much does pre-training frequency predict reasoning task performance?
- Do tokens beyond a critical threshold actually improve reasoning quality?
- Can concise reasoning traces match verbose explanation accuracy?
- Are larger models and search access substitutes for factual accuracy?
- Why does training data format shape reasoning strategy more than domain content?
- Why do language models fail at pronouns across distant segments?
- Why do language models fail at coreference across long contexts?
- Does explicit reasoning help or hurt tasks requiring continuous nuanced judgment?
- Why do large language models fail at temporal reasoning in complex legal cases?
- When should an LLM engage extended reasoning versus responding directly?
- Does thinking-token overuse actually degrade reasoning accuracy in practice?
- Can episodic and semantic memory improve long-horizon task reasoning?
- Why do models automatically adjust reasoning length to problem difficulty?
- Why does explicit reasoning degrade passage reranking performance?
- What causes snowball errors to accumulate across reasoning steps in language models?
- How does implicit meaning processing limit LLM pragmatic reasoning?
- Does more inference compute help reasoning models match specialized domain performance?
- When does long-context LLM reasoning fail where structured retrieval succeeds?
- Can hierarchical entity extraction from books enable both textual and visual reasoning?
- Can long-context readers handle compositional tasks or just semantic search?
- Does irrelevant context degrade reasoning even within model context limits?
- How should iterative research tasks limit context per reasoning turn?
- Does filtering passages before generation improve large model answer quality?
- Why do temporal reasoning patterns matter more than final answers?
- Can extended reasoning training capture individual strategic thinking styles?
- How should reasoning prompts adapt based on question complexity and type?
- Do reasoning models trade instruction following for deliberative capability?
- Does more thinking always help large language models or sometimes hurt?
- How does random walk length control reasoning complexity in question generation?
- Does model scaling improve knowledge storage faster than reasoning ability?
- Why might latent reasoning capture types of thinking that verbalized CoT cannot?
- What neuroscience evidence suggests language networks are not optimized for reasoning?
- How can entailment benchmarks separate genuine reasoning from memorization effects?
- Why do longer reasoning chains signal hesitation rather than depth?
- Does reasoning structure match explicit versus implicit task demands?
- How do neural memory modules extend context length beyond attention limits?
- What cognitive abilities distinguish metalinguistic analysis from language use?
- Why does extended reasoning fail for search and knowledge retrieval tasks?
- Can long-context models handle compositional reasoning requiring structured logic?
- How does context complexity affect LLM performance on temporal reasoning tasks?
- How does the distance between natural language and formal notation affect translation accuracy?
- Could real-time search systems avoid era sensitivity in legal reasoning?
- Can benchmark performance distinguish surface from structural linguistic knowledge?
- What prompting strategies most effectively boost long-context LLM performance on retrieval?
- What structural properties define effective long chain-of-thought reasoning?
- Can derivational traces be distinguished from stylistic mimicry of reasoning?
- Can adding more words to a passage actually interfere with meaning?
- Does high knowledge density in text reduce user motivation to read more?
- How do smaller models respond to longer reflection prompts?
- Can post-thinking compute on memory reduce query-time reasoning costs?
- Can language models reason without relying on surface level pattern matching?
- What makes deductive reasoning so brittle in language models overall?
- How does structural complexity in sentences degrade LLM reasoning systematically?
- What makes specific-facet questions outperform generic need-rephrasing requests?
- Can latent reasoning mechanisms and recursive tracking mechanisms be combined effectively?
- Why do reasoning models fail when input length increases even below context limits?
- How do logic units preserve document structure better than fixed-size chunking?
- Why does chain-of-thought prompting fail to fix length-induced reasoning degradation?
- Can models trained on longer contexts develop better fundamental reasoning abilities?
- How do longer reasoning chains create vulnerability to attacks?
- Why do format and structure matter more than actual content in reasoning?
- Can dataset design systematically expand reasoning graph diameter?
- How does scaling reasoning capability actually reduce instruction-following ability?
- Why does attention quality degrade as context length increases?
- How do retrieval heads interact with layer-level separation of knowledge and reasoning?
- How does chain-of-thought length affect attention to constraint tokens?
- Why do longer reasoning chains correlate with lower accuracy in o1-like models?
- What changes when reasoning models adopt trajectory-response output formats?
- Why do current speech benchmarks fail to measure reasoning over audio?
- Why does premise ordering shift syllogistic reasoning performance by over 30 percent?
- How much reasoning depth do we actually need for most real-world tasks?
- Do shorter reasoning chains maintain instruction adherence better than longer ones?
- Can language models perform genuine symbolic reasoning without semantic grounding?
- What is the optimal balance between search rounds and reasoning depth per round?
- Why does representation recycling of MI-peak tokens improve reasoning accuracy?
- Why do longer context windows alone fail to capture temporal dynamics in dialogue?
- How much does schema bloat actually degrade reasoning in large language models?
- Can structured prompts reduce reasoning steps while improving financial accuracy?
- Can minimal reasoning steps match verbose reasoning accuracy?
- Why does concise reasoning maintain accuracy with far fewer tokens?
- Does more thinking always improve language model accuracy?
- Can operationalizing theory into prompt structure improve reasoning more than theory itself?
- Why does scheme classification require more cognitive load than identifying premises?
- Can simple structure perturbations reliably expose memorization in reasoning models?
- Why do concise reasoning chains match verbose chain-of-thought token efficiency?
- Do distributed relational tasks consistently underperform local classification across NLP domains?
- Do base models truly possess latent reasoning capability?
- Can cognitive scaffolding replace tool-based reasoning augmentation in language models?
- Why do reasoning tasks improve more than retrieval from lookup memory?
- How does separating local and global context dependencies affect long-context performance?
- How do prior errors in reasoning context amplify future mistakes?
- Does training data format shape reasoning strategy more than domain content?
- What causes reasoning quality to degrade during long research tasks?
- How do prior errors in context history amplify future mistakes in long tasks?
- Do pretrained language models carry reusable computational scaffolding for length handling?
- Can bounded workspaces prevent overthinking better than summarization alone?
- Why might rationales that predict common text patterns fail on hard novel reasoning?
- Why do long-context language models struggle with compositional reasoning tasks?
- Does chain-of-thought accuracy degrade with longer reasoning traces?
- When is numeric computation the real bottleneck versus reasoning depth?
- What quality filters distinguish useful reasoning enrichment from shallow repetition?
- Why do thinking models execute longer tasks than standard language models?
- Does sequence length affect sparsity tolerance the same way across task types?
- How do frontier models maintain agreement scores above 90 percent across reasoning tasks?
- Can auxiliary modules preserve reasoning without catastrophic forgetting?
- What kinds of reasoning tasks reveal the ceiling of text-only training?
- Can autoformalisation from natural language preserve semantic accuracy?
- What computational structures can actually scale serial reasoning depth?
- Why do fixed-size document chunks break complex procedural question answering?
- How much does training data format influence reasoning strategy versus domain content?
- Can standard next-token prediction capture complex multi-step human reasoning directly?
- Is reasoning failure caused by task complexity or training distribution gaps?
- Are newer larger language models actually worse at faithful summarization?
- Does recurrent memory or gist compression work better for ultra-long context?
- Can recurrent state mechanisms process longer sequences than attention-based working memory approaches?
- How do adaptive memory modules compare to feedback-based working memory for long context?
- Why does document perplexity stay low while question-answering accuracy drops?
- Why does attention concentrate on the first 25% of long input sequences?
- How does externalized state affect the long-context bottleneck in language models?
- How does reducing activation precision further extend context length?
- How do recurrent memory systems handle ultra-long context differently than attention?
- How does tool-based reasoning expand what language models can do?
- How does evaluation setting affect measured reasoning capabilities in language models?
Related concepts in this collection 5
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Why do language models fail at temporal reasoning in complex tasks?
Language models correctly answer simple temporal questions but produce logically impossible timelines in complex legal documents. This explores what task features trigger reasoning failures and whether the competence is genuinely lost or masked by surface-level patterns.
complementary failure axis: task complexity vs input length
-
Why does vanilla RAG produce shallow and redundant results?
Standard RAG systems get stuck in a single semantic neighborhood because their initial query determines what documents are discoverable. The question asks whether fixed retrieval strategies fundamentally limit knowledge depth compared to iterative exploration.
RAG retrieval adds length; length degrades reasoning
-
Does more thinking time actually improve LLM reasoning?
The intuition that extended thinking helps LLMs reason better seems obvious, but what does the empirical data actually show when we test it directly?
another dimension where "more" (tokens) ≠ "better" (reasoning)
-
Can long-context models resolve retriever-reader imbalance?
Traditional RAG systems force retrievers to find precise passages because readers had small context windows. Do modern long-context LLMs change what architecture makes sense?
challenges the long-context solution: reader burden increases with length but reasoning degrades
-
Do vector embeddings actually measure task relevance?
Vector embeddings rank semantic similarity, but RAG systems need topical relevance. When these diverge—as with king/queen versus king/ruler—does similarity-based retrieval fail in production?
compounds the length problem: semantic retrieval returns associated-but-irrelevant documents, creating exactly the irrelevant padding that FLenQA shows degrades reasoning; imprecise retrieval directly produces the input-length degradation
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Same Task, More Tokens: the Impact of Input Length on the Reasoning Performance of Large Language Models
- Longer Context, Deeper Thinking: Uncovering the Role of Long-Context Ability in Reasoning
- Think Deep, Not Just Long: Measuring LLM Reasoning Effort via Deep-Thinking Tokens
- When More is Less: Understanding Chain-of-Thought Length in LLMs
- On the Reasoning Capacity of AI Models and How to Quantify It
- Between Underthinking and Overthinking: An Empirical Study of Reasoning Length and correctness in LLMs
- Beyond Accuracy: Evaluating the Reasoning Behavior of Large Language Models -- A Survey
- When Thinking Fails: The Pitfalls of Reasoning for Instruction-Following in LLMs
Original note title
reasoning performance degrades with input length even far below context window limits