Can long-context readers handle compositional tasks or just semantic search?
This explores whether long-context language models — ones that read huge documents in a single pass — can actually *reason over and combine* the pieces they read, or whether they're really just very good lookup engines that find semantically similar text.
This explores whether long-context readers can do compositional work — joining, chaining, and combining facts across a long input — or whether their strength is limited to semantic search, finding the passage that *sounds like* your query. The corpus draws a surprisingly clean line here, and the answer is: mostly the latter, with a real wall at composition.
The sharpest evidence comes from the LOFT benchmark, which finds that long-context models can quietly absorb the job of a retrieval system for *semantic* lookups — no special training needed — but collapse the moment a task requires relational queries like joins across structured tables Can long-context LLMs replace retrieval-augmented generation systems?. Stuffing more text into the window doesn't close that gap; the limitation isn't how much the model can see, it's what it can *do* with what it sees. That dovetails with work on compositional reasoning showing transformers tend to succeed by memorizing computation subgraphs from training and then failing drastically on novel combinations, with errors compounding step by step Do transformers actually learn systematic compositional reasoning?. Composition isn't a capability that scales with context length — it's a different kind of skill the architecture doesn't reliably have.
There's also a quieter failure that undercuts even the 'good at search' story: reasoning accuracy degrades sharply as inputs grow, well *below* the advertised context limit — dropping from 92% to 68% with just a few thousand tokens of padding, even with chain-of-thought Does reasoning ability actually degrade with longer inputs?. So the long window is partly a paper capacity; the effective reasoning window is much smaller. One line of research argues this is because the real bottleneck isn't memory but the *compute* needed to consolidate read context into the model's working state — more consolidation passes help, suggesting reading-then-reasoning is its own expensive operation, not a free side effect of attention Is long-context bottleneck really about memory or compute?.
What's interesting is that the field is leaning *into* the search strength rather than fighting the composition weakness. LongRAG shows the optimal design shifting burden from precise retrieval onto a long-context reader — coarse ranking plus deep reading beats fine-grained retrieval Can long-context models resolve retriever-reader imbalance? — which is exactly the move you'd make if you trusted the reader to *find and absorb* but not to *combine across structured relations*. And for genuinely multi-step work, agents do better when you ration reasoning per turn so each retrieval round has room to breathe, treating composition as an iterative external loop rather than something the reader does internally in one shot Does limiting reasoning per turn improve multi-turn search quality?.
The thing you might not have known you wanted to know: a long context window is closer to a bigger search index than a bigger brain. Semantic retrieval rides for free; compositional reasoning has to be engineered back in — through structured query tools, external loops, or architectures (like neural memory that compresses surprising tokens) that separate 'holding a lot' from 'computing over it' Can neural memory modules scale language models beyond attention limits?.
Sources 7 notes
The LOFT benchmark shows LCLMs match RAG on semantic retrieval without explicit training, but cannot execute relational queries requiring joins across structured tables. Context length alone cannot bridge this gap.
Research shows transformers succeed on in-distribution tasks by memorizing computation subgraphs from training data, not by learning systematic rules. They fail drastically on novel compositions, with errors compounding across reasoning steps.
FLenQA shows reasoning accuracy drops from 92% to 68% at just 3000 tokens of padding, far below context window capacity. The degradation is task-agnostic, uncorrelated with language modeling performance, and persists even with chain-of-thought prompting.
Research shows the bottleneck is not memory capacity but the compute required to consolidate evicted context into fast weights during offline sleep phases. Performance improves with more consolidation passes, following a test-time scaling pattern on harder reasoning tasks.
LongRAG shows that 4K-token units and long-context readers outperform 100-word retrieval on standard benchmarks. The optimal RAG design shifts from precise retrieval to coarse ranking plus deep reading as context windows expanded.
Unrestricted reasoning within single search turns consumes context needed for subsequent retrieval rounds, degrading the agent's ability to incorporate new evidence. Setting per-turn reasoning budgets, not just overall time limits, prevents this context erosion and maintains search quality across iterations.
Titans architecture separates attention (short-term, quadratic) from neural memory (long-term, compressed), prioritizing surprising tokens for storage. The model outperforms standard Transformers and linear RNNs across tasks while scaling to 2M+ token contexts without quadratic penalties.