INQUIRING LINE

Inquiring lines›How do language models construct a…›Can LLMs provide genuinely empathe…›How should iterative research syst…›this inquiring line

Should AI systems follow a fixed pipeline for every query, or learn to invent a custom workflow on the fly?

How do cascaded probabilistic models compare to reinforcement learning for per-query system design?

This explores two ways to decide how a system should handle each incoming query: cascaded/staged designs that split the work into fixed probabilistic stages (plan, then retrieve, then answer), versus reinforcement learning that lets a meta-agent invent a custom workflow for each query — and the corpus actually disagrees with itself about which wins.

This pits two design philosophies against each other. A cascaded approach hands a query through a fixed sequence of stages — separate query planning, then answer synthesis — where each stage does one probabilistic job. The reinforcement-learning approach instead trains a meta-agent to *generate* a bespoke architecture per query, optimizing the whole pipeline against execution feedback rather than locking the stages in advance.

The corpus's most interesting move is that it argues both sides. On the cascaded side, separating query planning from answer synthesis measurably reduces interference and improves multi-hop performance — the stages stop stepping on each other Do hierarchical retrieval architectures outperform flat ones on complex queries?. But on the RL side, that same separation is framed as the *problem*: when you split asking, recommending, and timing into isolated decisions, gradient signals can't inform one another and you never optimize the full trajectory holistically — a single learned policy beats the separated version Can unified policy learning improve conversational recommender systems?. So 'decompose into clean stages' and 'fuse everything into one learned policy' are both empirically supported, just on different axes — cascading buys interpretability and reduced interference; RL buys joint optimization.

The sharpest case for going fully per-query with RL is FlowReasoner, where a meta-agent trained on external execution feedback builds a unique multi-agent system for each user query, trading across performance, complexity, and efficiency instead of reusing one task-level template Can AI systems design unique multi-agent workflows per individual query?. The same instinct shows up in graph retrieval, where MCTS plus RL replaces reading the whole graph with a learned, query-specific traversal — accepting uncertainty about the full graph in exchange for fitting the decision inside the context window Can learned traversal policies beat exhaustive graph reading?. The pattern: RL shines when the right structure genuinely varies query to query and you have a usable reward signal.

That reward-signal caveat is where the two camps can actually merge rather than compete. You don't always need full RL — LLMs can construct reward-shaping functions by first solving a simplified *deterministic* version of the problem, then porting the plan into shaping rewards for the stochastic task Can LLMs design reward functions for reinforcement learning?. That's a cascaded, probabilistic scaffold feeding an RL objective — the deterministic abstraction does the cheap structural reasoning, RL handles the messy residual. And RL's footprint is smaller than it looks: across many algorithms and model families it updates only 5–30% of parameters in structured, near-full-rank subnetworks, suggesting the 'learned per-query' machinery is editing a surprisingly compact part of the system Does reinforcement learning update only a small fraction of parameters?.

The thing you didn't know you wanted to know: per-query adaptation doesn't have to live in the weights at all. AgentFly treats agent learning as a memory-augmented decision process, doing credit assignment and policy improvement entirely through memory operations — no parameter updates — and still hit strong benchmark numbers Can agents learn continuously from experience without updating weights?. So the real spectrum isn't 'cascaded probabilistic stages' versus 'RL' — it's a gradient from fixed stages, to RL-shaped stages, to learned-policy generation, to memory-based adaptation, and you can mix them based on how much your query distribution actually varies and how clean your feedback signal is.

Sources 7 notes

Do hierarchical retrieval architectures outperform flat ones on complex queries?

Separating query planning from answer synthesis into distinct components reduces interference and improves multi-hop query performance. This architectural principle mirrors documented benefits of separating planning from execution in agent design.

Can unified policy learning improve conversational recommender systems?

Research shows that formulating attribute-asking, item-recommending, and timing decisions as a single graph-based RL policy achieves better joint optimization than isolated components. Separation prevents gradient signals from informing one another and fails to optimize conversation trajectory holistically.

Can AI systems design unique multi-agent workflows per individual query?

FlowReasoner demonstrates that meta-agents trained with reinforcement learning and external execution feedback can generate unique multi-agent architectures for each user query, optimizing across performance, complexity, and efficiency—moving beyond fixed task-level workflow templates.

Can learned traversal policies beat exhaustive graph reading?

Graph-O1 replaces whole-graph ingestion with step-by-step agentic navigation using Monte Carlo Tree Search and reinforcement learning. This approach fits within LLM context windows while learning domain-specific traversal policies, though it trades certainty about the full graph for decision-making under uncertainty.

Can LLMs design reward functions for reinforcement learning?

MEDIC shows that LLMs can generate effective reward shaping functions by first solving a deterministic, simplified version of the RL problem, then converting the resulting plan into shaping rewards for the original stochastic task. A model-based critic validates LLM outputs before deployment.

Show all 7 sources

Does reinforcement learning update only a small fraction of parameters?

Across seven RL algorithms and ten LLM families, RL induces intrinsic parameter sparsity of 5–30% without explicit regularization. Critically, these sparse updates are nearly full-rank and nearly identical across random seeds, indicating structural rather than arbitrary parameter selection.

Can agents learn continuously from experience without updating weights?

AgentFly formalizes agent learning as a Memory-augmented MDP with three memory modules (case, subtask, tool) that enable credit assignment and policy improvement entirely through memory operations. The approach achieved 87.88% on GAIA validation without modifying LLM parameters.

Papers this line draws on 8

The research behind the notes this line reads — ranked by how closely each paper relates.

The Landscape of Agentic Reinforcement Learning for LLMs: A Survey1.71 match · arxiv ↗
Look Before You Leap: Autonomous Exploration for LLM Agents1.68 match · arxiv ↗
Efficient Reinforcement Learning via Large Language Model-based Search0.92 match · arxiv ↗
Reinforcement Learning Finetunes Small Subnetworks in Large Language Models0.90 match · arxiv ↗
Useful Memories Become Faulty When Continuously Updated by LLMs0.90 match · arxiv ↗
AgentFly: Fine-tuning LLM Agents without Fine-tuning LLMs0.90 match · arxiv ↗
ReasoningBank: Scaling Agent Self-Evolving with Reasoning Memory0.90 match · arxiv ↗
Unified Conversational Recommendation Policy Learning via Graph-based Reinforcement Learning0.89 match · arxiv ↗

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a systems architect evaluating whether to lock a per-query pipeline into fixed cascaded stages (query plan → answer synthesis) or train a meta-agent via RL to build a custom architecture per query. A curated library spanning 2021–2026 made these dated claims — treat them as testable, not current truth:

**What a curated library found — and when (findings span 2021–2026; these are perishable claims):**
• Cascaded separation (query planning from answer synthesis) reduces interference and improves multi-hop performance, but isolates gradient signals; RL-trained unified policies optimize the full trajectory jointly (~2024–2025).
• Per-query RL meta-agents (FlowReasoner) build personalized multi-agent systems, trading off performance, complexity, and efficiency—outperforming one-size-fits-all templates (~2025).
• RL editing is sparse: only 5–30% of parameters update in structured, near-full-rank subnetworks, suggesting learned-per-query adaptation occupies compact model real estate (~2025).
• Memory-based adaptation (AgentFly) achieves credit assignment and policy improvement via memory operations alone, with no parameter updates, yet hit strong benchmarks (~2026).
• LLMs can bootstrap RL reward-shaping by solving a deterministic abstraction first, then porting the plan into stochastic rewards—merging cascaded scaffolding with RL objectives (~2025).

**Anchor papers (verify; mind their dates):**
• 2024-05, arXiv:2405.15194 — Efficient RL via LLM-based Search
• 2025-04, arXiv:2504.15257 — FlowReasoner: Query-Level Meta-Agents
• 2025-05, arXiv:2505.11711 — RL Finetunes Small Subnetworks in LLMs
• 2026-05, arXiv:2605.12978 — Useful Memories Become Faulty When Continuously Updated

**Your task:**
(1) **RE-TEST EACH CONSTRAINT.** For cascaded-vs.-RL, separate the durable question (when does query-specific architecture beat fixed stages?) from perishable claims. Has newer reasoning-scaling (o1-style models, post-training at scale) altered the trade-off between joint optimization and stage isolation? Does reasoning-time compute now let fixed cascades match RL's holism without learning? Cite what resolved or clarified each tension.
(2) **Surface the strongest CONTRADICTING or SUPERSEDING work** from the last ~6 months. Does any paper argue that memory-based adaptation (2026-05) actually *regresses* when continuously fine-tuned, suggesting RL weight updates are more durable? Does anything challenge the sparse-update finding (~5–30% of parameters)?
(3) **Propose 2 research questions** that assume the regime may have shifted: (a) Can reasoning-time orchestration (routing, backtracking, re-planning) achieve per-query adaptation *without* either learned policies or memory ops—purely through in-context LLM search? (b) When does the memory-update instability (2026-05) force you back to RL or fixed cascades, and can hybrid memory + sparse weight-updates mitigate it?

**Cite arXiv IDs; flag anything you cannot ground in a real paper.**

Should AI systems follow a fixed pipeline for every query, or learn to invent a custom workflow on the fly?

Related lines of inquiry

Sources 7 notes

Papers this line draws on 8