INQUIRING LINE

Can domain-expert workflows always decompose into inspectable stages for AI?

This explores whether every expert's work can be broken into discrete, checkable steps an AI can supervise — and where the word 'always' breaks down.


This reads the question as being about the word 'always.' Decomposing expert work into inspectable stages is one of the most productive moves in the corpus — but it's a property of some domains and some tasks, not a universal guarantee. The most optimistic case is Can algorithms control LLM reasoning better than LLMs alone?: wrap an LLM inside an explicit algorithm, hand each call only its step-relevant context, and complex reasoning becomes a set of modular, debuggable sub-tasks. In the same spirit, Can agents learn reusable sub-task routines from past experience? shows agents can extract reusable sub-task routines and compound them hierarchically, with the biggest gains exactly when the task is novel — evidence that stage-level structure transfers where whole-task memorization fails.

But the corpus is blunt about why decomposition isn't always available: the bottleneck is the domain, not the model. What makes a research domain suitable for autonomous optimization? argues a domain only yields to staged autonomous work if it has immediate scalar metrics, modular architecture, fast iteration, and version control — and a domain missing any one of these resists decomposition 'regardless of LLM capability.' That's the direct answer to 'always': no. An expert workflow whose quality can't be scored mid-stream, or whose steps don't separate cleanly, doesn't become inspectable just because you point a smarter model at it.

The subtler trap is that decomposing into stages and making those stages actually inspectable are two different things. Do frontier LLMs silently corrupt documents in long workflows? found that across long delegated relays, frontier models corrupt about a quarter of document content — and the errors compound silently, never plateauing, precisely because nobody is inspecting the intermediate hand-offs. So the stages existed; the inspection didn't. Where do reasoning agents actually fail during long traces? is the corrective: checking intermediate states rather than final answers lifted task success from 32% to 87%, because most failures were process violations invisible to outcome scoring. Inspectable stages only pay off if you actually verify the stages.

Where clean stage labels aren't handed to you, the corpus offers a backdoor: infer them from structure. Can trajectory structure replace hand-annotated process rewards? shows you can mine dense step-level signal from a trajectory's own shape — tree topology, expert-aligned actions, tool-call positions — without hand-annotating each stage. And Can knowledge graphs teach models deep domain expertise? builds expertise bottom-up from compositional primitives along knowledge-graph paths, which is another way of saying some expert reasoning can be reconstructed as composable units even when the human expert never articulated them as steps.

So the honest synthesis: not always — and the reasons are worth knowing. Decomposability is a structural property of the domain (What makes a research domain suitable for autonomous optimization?); decomposition without active mid-process verification gives you the illusion of inspectability while errors compound underneath (Do frontier LLMs silently corrupt documents in long workflows?, Where do reasoning agents actually fail during long traces?); and even where you succeed, the stages plus tools plus memory have to be engineered as a pipeline, not assumed — Can you turn an LLM into an agent by just fine-tuning? makes the point that the surrounding harness, not the model, decides whether each action is grounded or hallucinated.


Sources 8 notes

Can algorithms control LLM reasoning better than LLMs alone?

LLM Programs embed LLMs within explicit algorithms that manage control flow and state, presenting only step-specific context to each LLM call. This information hiding addresses capability and context window limits while treating complex reasoning as modular, debuggable sub-tasks.

Can agents learn reusable sub-task routines from past experience?

Agent Workflow Memory induces sub-task routines at finer granularity than full tasks, abstracts example-specific values, and compounds them hierarchically. This produces 24.6% relative gain on Mind2Web and 51.1% on WebArena, with larger gains as train-test gaps widen.

What makes a research domain suitable for autonomous optimization?

Autonomous research pipelines require immediate scalar metrics, modular architecture, fast iteration cycles, and version control. Domains lacking any property resist autoresearch regardless of LLM capability, because the bottleneck is environmental structure, not model power.

Do frontier LLMs silently corrupt documents in long workflows?

Testing 19 models across 52 domains shows even advanced systems degrade documents by ~25% over extended relay tasks, with errors compounding silently without plateauing through 50 round-trips.

Where do reasoning agents actually fail during long traces?

Reliability for long-trace reasoning comes from checking intermediate states and policy compliance during generation, not from scoring final outputs. Adding intermediate verification raised task success from 32% to 87% because most failures are process violations, not wrong answers.

Can trajectory structure replace hand-annotated process rewards?

Tree-GRPO, Supervised RL, and ToolPO each convert sparse outcome rewards into dense step signals by exploiting different structural features—tree topology, expert-aligned actions, and tool-call positions—eliminating the need for annotated process reward models.

Can knowledge graphs teach models deep domain expertise?

Fine-tuning a 32B model on 24,000 reasoning tasks derived from medical knowledge graph paths produces state-of-the-art performance across 15 medical domains, demonstrating that structured knowledge composition matters more than scale.

Can you turn an LLM into an agent by just fine-tuning?

Converting LLMs to action-capable systems requires four distinct stages: curating action-environment-user datasets, training for action grounding, integrating agent infrastructure with memory and tools, and rigorous safety evaluation. The surrounding system and harness determine whether actions are grounded or hallucinated.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst re-testing whether domain-expert workflows always decompose into inspectable stages for AI. The question remains open; treat findings below as dated claims to verify, not current truth.

What a curated library found — and when (findings span 2022–2026):
• Decomposability is a structural property of the domain, not the model: a workflow resists staged AI work if it lacks immediate scalar metrics, modular architecture, fast iteration, or version control — regardless of LLM capability (2024).
• Frontier models silently corrupt ~25% of document content over long delegated relays; errors compound without active inspection of intermediate hand-offs (2026).
• Mid-process verification (checking reasoning, not just outcomes) lifted task success from 32% to 87% — most failures are process violations invisible to outcome scoring (2024).
• Process supervision can be mined from trajectory structure (tree topology, expert-aligned actions, tool positions) without hand-annotation; knowledge-graph curricula reconstruct expertise as composable primitives (2025).
• The surrounding harness (memory, caching, orchestration, tool-call verification), not the model alone, decides whether staged actions ground or hallucinate (2025).

Anchor papers (verify; mind their dates):
• arXiv:2409.07429 — Agent Workflow Memory (2025)
• arXiv:2604.15597 — LLMs Corrupt Your Documents When You Delegate (2026)
• arXiv:2507.13966 — Bottom-up Domain-specific Superintelligence (2025)
• arXiv:2508.15760 — LiveMCP-101: Stress Testing MCP-enabled Agents (2025)

Your task:
(1) RE-TEST EACH CONSTRAINT. For every claim above, assess whether newer models (reasoning-focused, post-o1), improved orchestration (multi-agent memory, persistent caching, verified tool-call chains), or stronger evaluation harnesses have since relaxed or overturned it. Separate the durable question — *which domains intrinsically resist decomposition?* — from perishable limitations like *silent corruption over relays* (now mitigated by checkpointing?). Cite what resolved it.
(2) Surface the strongest work from the last ~6 months that either contradicts the "domain constraints" framing or supersedes the "silent corruption" finding — e.g., do newer harnesses prevent relay errors, or do they merely make them visible?
(3) Propose 2 research questions that assume the regime may have shifted: e.g., *Can process supervision now be automated to the point that verification overhead vanishes?* *Do multi-turn reasoning models dissolve the modular-architecture requirement?*

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines