INQUIRING LINE

Inquiring lines›How do language models construct a…›Can LLMs provide genuinely empathe…›How should iterative research syst…›this inquiring line

When AI search can't rewrite its own questions, does throwing more compute at it still pay off?

What scaling behavior do partial systems show without iterative query refinement?

This explores how search-and-retrieval systems scale when you just give them more compute — more search steps, more parallel tries — rather than letting them iteratively rewrite their queries, and what that reveals about where extra effort pays off and where it hits a wall.

This reads as a question about what happens when you scale a system by brute force — more search budget, more parallel reasoning — instead of by making it smarter about reformulating what it's looking for. The corpus has a surprisingly clean answer, and it cuts both ways.

The encouraging half: search behaves like a compute dial. Agentic deep research shows that the number of search steps follows almost exactly the same scaling curve as reasoning tokens — pour in more retrieval and answer quality climbs, then flattens into diminishing returns Does search budget scale like reasoning tokens for answer quality?. This reframes search not as a fixed lookup but as a knob you can trade against reasoning, a genuine inference-compute axis How does test-time scaling work for individual research agents?. And the scaling doesn't have to go deeper to pay off — it can go wider. Sampling many parallel paths through the solution space matches the benefits of longer serial chains without paying the latency cost of depth Can reasoning systems scale faster by exploring parallel paths instead?. So a 'partial' system left to grind without query refinement still improves with scale, just along a predictable, eventually-flattening curve.

The sobering half: scale hits ceilings that no amount of budget moves. On constrained-optimization tasks, LLMs converge to roughly 55–60% constraint satisfaction regardless of parameter count, architecture, or training regime — reasoning models don't systematically beat standard ones, which points to a fundamental wall rather than a scaling gap Do larger language models solve constrained optimization better?. Frontier reasoning models stall at 20–23% on problems that demand genuine backtracking, even though they sound fluent while doing it Can reasoning models actually sustain long-chain reflection?. The reason is architectural: autoregressive generation can't retract a token it has already emitted, while solving these problems requires discarding bad partial attempts and trying again — exactly the move scaling alone can't supply Why does autoregressive generation fail at constraint satisfaction?.

Put together, the surprise is that 'more compute' and 'better querying' fix different things. Throwing search budget at a task buys you the smooth scaling curve, but where retrieval fails it tends to fail structurally — wrong trigger timing, embeddings that measure association rather than relevance, hard mathematical limits on what a vector can represent — and those are not problems you tune your way out of by scaling Where do retrieval systems fail and why?. The thing you didn't know you wanted to know: a system without iterative refinement can look like it's improving right up until it isn't, because the scaling curve and the architectural ceiling are two separate phenomena stacked on top of each other.

Sources 7 notes

Does search budget scale like reasoning tokens for answer quality?

Agentic deep research shows monotonic-to-diminishing-returns curves for search iterations, matching reasoning token scaling. This creates a new inference-compute axis: models can trade off reasoning budget against search budget to optimize answer quality.

How does test-time scaling work for individual research agents?

Research shows that deep research agents exhibit test-time scaling laws where search steps scale similarly to reasoning tokens, and live search outperforms memorized retrieval on knowledge-intensive tasks. Data efficiency is extreme—78 curated demonstrations outperform 10K samples for agency.

Can reasoning systems scale faster by exploring parallel paths instead?

GRAM demonstrates that recursive reasoning models should maintain and explore multiple latent trajectories in parallel, not only deepen single paths. Width-scaling avoids the serial latency penalty of depth while sampling the solution distribution more effectively on ambiguous problems.

Do larger language models solve constrained optimization better?

Across constrained-optimization tasks, LLMs converge to ~55–60% constraint satisfaction independent of architecture, parameter count, or training regime. Reasoning models do not systematically outperform standard models, suggesting a fundamental ceiling rather than a scaling gap.

Can reasoning models actually sustain long-chain reflection?

DeepSeek-R1 and o1-preview achieve only 20-23.6% exact match on 850 constraint satisfaction problems requiring genuine backtracking. This ceiling reveals that reflective reasoning fluency does not translate to actual problem-solving competence on unfamiliar instance structures.

Show all 7 sources

Why does autoregressive generation fail at constraint satisfaction?

The performance ceiling on constraint satisfaction problems is not a model-quality issue but an architectural limitation: autoregressive transformers cannot retract emitted tokens, while CSP solvers fundamentally depend on discarding invalid partial assignments. Symbolic solver integration works because it supplies what the architecture lacks.

Where do retrieval systems fail and why?

RAG systems fail at three structural levels: adaptive triggering (fixed intervals waste context), semantic-task mismatch (embeddings measure association, not relevance), and mathematical limits (embedding dimension constrains representable document sets). These require fundamentally different retrieval approaches, not tuning.

Papers this line draws on 8

The research behind the notes this line reads — ranked by how closely each paper relates.

ZebraLogic: On the Scaling Limits of LLMs for Logical Reasoning4.23 match · arxiv ↗
Can Large Language Models Reason and Optimize Under Constraints?2.57 match · arxiv ↗
Learning to Think: Information-Theoretic Reinforcement Fine-Tuning for LLMs2.50 match · arxiv ↗
From Web Search towards Agentic Deep Research: Incentivizing Search with Reasoning Agents1.77 match · arxiv ↗
Reasoning Models Can Be Effective Without Thinking1.70 match · arxiv ↗
Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters1.68 match · arxiv ↗
Generative Recursive Reasoning0.88 match · arxiv ↗
LR^2Bench: Evaluating Long-chain Reflective Reasoning Capabilities of Large Language Models via Constraint Satisfaction Problems0.87 match · arxiv ↗

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a systems researcher auditing scaling claims in LLM retrieval and reasoning. The question: Do partial systems (search + inference without iterative query refinement) show predictable scaling behavior, and where do they hit hard ceilings?

What a curated library found — and when (dated claims, not current truth):
Findings span 2024–2026. A curated library identified:
• Search budget scales like compute: retrieval steps follow test-time scaling curves similar to reasoning tokens, with diminishing returns (~2025–2026, arXiv:2506.18959).
• Parallel sampling (width) matches serial depth benefits without latency penalty; reasoning systems should scale in width by sampling latent trajectories (~2025, arXiv:2502.12018).
• Hard architectural ceiling: autoregressive systems plateau at 55–60% constraint satisfaction and ~20–23% on backtracking tasks, independent of scale (~2026, arXiv:2603.23004).
• Failure modes are structural: timing, embedding relevance mismatch, vector capacity limits — not tunable via scaling alone (~2024–2025).

Anchor papers (verify; mind their dates):
• arXiv:2506.18959 (2025-06): Agentic Deep Research; test-time scaling laws for search.
• arXiv:2603.23004 (2026-03): Constraint satisfaction ceiling across parameter counts.
• arXiv:2503.24235 (2025-03): Survey on test-time scaling; synthesis of width/depth tradeoffs.
• arXiv:2502.05171 (2025-02): Latent reasoning and recurrent depth scaling.

Your task:
(1) RE-TEST THE CEILING CLAIM. The library asserts autoregressive generation cannot retract emitted tokens, causing hard failure on backtracking. Has non-autoregressive decoding, constrained generation, or in-context token revision (e.g., editing layers, diffusion-based refinement, arXiv:2502.09992) since relaxed this? Separately: do newer evaluations (LR²Bench, arXiv:2502.17848) confirm the 20–23% plateau, or do larger models / scaffold methods exceed it? State plainly what still holds.
(2) Surface the strongest CONTRADICTING work from the last 6 months. Does any recent paper show iterative refinement-free scaling beating the 55–60% / 20–23% barriers, or argue the ceiling is instrumentation artifact, not fundamental?
(3) Propose 2 durable research questions that assume the regime may have shifted: (a) If width-scaled reasoning + modern retrieval can bypass the backtracking wall, what is the new bottleneck? (b) Under what conditions does search-budget scaling remain predictable vs. bifurcating into orthogonal failure modes?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

When AI search can't rewrite its own questions, does throwing more compute at it still pay off?

Related lines of inquiry

Sources 7 notes

Papers this line draws on 8