What repair strategies work best at each level of Clark's ladder?
This reads the question through Herbert Clark's ladder of communicative grounding — the idea that any exchange has stacked levels (attention, hearing the signal, understanding the meaning, taking up the intent) and that repair should target the lowest rung where things actually broke. The honest answer up front: this corpus doesn't talk about Clark or conversational repair directly, but it has a surprisingly clean parallel — a body of work on catching machine errors at the level where they originate rather than at the surface.
Reading this as Clark's grounding ladder — where a breakdown at a low rung (mishearing) needs a different fix than a breakdown at a high rung (misunderstanding intent) — the corpus has no note that names Clark, but it keeps rediscovering his core move: repair at the level where the failure lives, not at the output. The sharpest version is the finding that checking a reasoning agent's *intermediate states* during generation, rather than scoring its final answer, lifted task success from 32% to 87% because most failures were process violations, not wrong conclusions Where do reasoning agents actually fail during long traces?. That is a ladder argument in disguise: a wrong final answer is the top rung, but the actual rupture happened several rungs down, and you only repair it if you intervene there.
The corpus also suggests that the *right rung depends on who's failing*. Weaker models break loudly, by deleting content; frontier models break quietly, by corrupting it while the surface still looks competent Do frontier models fail differently than weaker models?. In Clark's terms, the weak-model failure is a low-rung signal problem you can catch by inspection, while the frontier failure is a high-rung meaning problem that passes every surface check — so the repair strategy has to climb as capability climbs. And tellingly, you can't fix the high-rung failure with low-rung tools: giving a model better editing tools didn't improve reliability, because the breakdown was upstream in its judgment about *what* to change, not in the mechanics of changing it Can better tools fix LLM document editing errors?. Repairing the wrong rung does nothing.
There's even a vocabulary-level version of this. Calling LLM errors 'hallucinations' points repair at the perception rung — ground the model better, show it more truth — when the real failure is at the generation rung, where accurate and fabricated text come out of the identical statistical process. Renaming them 'fabrications' redirects the fix toward verification and calibrated uncertainty instead of grounding Does calling LLM errors hallucinations point us toward the wrong fixes?. The name you give the breakdown decides which rung you try to repair, and the wrong name wastes the effort.
Where the corpus gets constructive about *building* repair into each level is in training. Assigning a full episode's reward back to each individual step, then normalizing across rollouts, lets the system see which specific action in a long sequence deserved credit or blame — repair localized to a rung rather than smeared across the whole chain Can full episode rewards per step enable better credit assignment?. And the curriculum result — imitation first to establish reasonable moves, then verifiable-reward refinement to sharpen them — is essentially repairing the lower rungs before the higher ones become legible at all, because outcome rewards are uninformative until the foundational behavior exists Does sequencing imitation then exploration training improve reasoning?.
So the thing you didn't know you wanted: across reasoning, document editing, terminology, and training, this collection keeps arriving at one rule that maps directly onto Clark — diagnose which level actually broke, because the repair that works at one rung is inert at every other. What the corpus *can't* yet tell you is which concrete repair move belongs to each named rung of a human–AI conversation; that's an open seam worth its own note rather than a question this library answers.
Sources 6 notes
Reliability for long-trace reasoning comes from checking intermediate states and policy compliance during generation, not from scoring final outputs. Adding intermediate verification raised task success from 32% to 87% because most failures are process violations, not wrong answers.
DELEGATE-52 demonstrates that LLMs degrade documents through qualitatively different mechanisms by capability tier: weaker models fail through visible content deletion, while frontier models fail through silent content corruption. This shift makes frontier failures harder to detect in long workflows despite apparent surface competence.
DELEGATE-52 shows that agentic tool access fails to improve performance on long-horizon document tasks. The degradation mechanism originates upstream in the model's judgment about what to change, not in editing interface limitations.
LLMs generate text through identical statistical processes regardless of accuracy, making 'fabrication' the more honest term. This reframes the fix from perception-based grounding to verification systems and calibrated uncertainty in use case design.
MS-GRPO assigns cumulative episode reward to each step, and group-relative normalization across rollouts surfaces which action sequences succeed. A 3B model post-trained this way outperforms 72B baselines by 50%, showing the training method matters more than scale for multi-step tasks.
Running Supervised RL first to establish reasoning foundations, then RLVR to refine against verifiable rewards, substantially outperforms both methods in isolation. The imitation phase makes outcome rewards informative by creating reasonable rollouts the RL phase can then sharpen.