INQUIRING LINE

Which code verification tasks still require execution instead of reasoning?

This explores where verifying code can now be done by reasoning about it (structured 'thinking' instead of running it), and where actually executing the code is still the only reliable check.


This reads the question as a frontier-mapping problem: the corpus has been steadily moving tasks out of the "must execute" column and into the "can reason" column — so the interesting answer is which tasks have resisted that move. The headline result is that a lot less requires execution than you'd expect. Semi-formal reasoning — natural-language templates that force an agent to lay out premises, trace each code path, and check evidence — hits 93% accuracy verifying whether two patches are equivalent, crossing the reliability bar needed to use it as a training reward signal Can structured reasoning replace code execution for RL rewards?. Those templates work by importing the *discipline* of formal verification (no skipped cases, no unsupported claims) without the symbolic machinery, catching subtle bugs like function shadowing that free-form thinking sails past Can structured templates replace formal verification for code reasoning? Can structured templates make code reasoning more reliable than free-form thinking?. So patch equivalence, fault localization, and policy-compliance checking are increasingly reasoning tasks, not execution tasks.


Sources 8 notes

Can structured reasoning replace code execution for RL rewards?

Semi-formal reasoning templates enable execution-free patch equivalence verification at 93% accuracy on real agent code, crossing the reliability threshold needed for RL reward signals. This makes execution-free verification viable for certain task classes like fault localization and code reasoning.

Can structured templates replace formal verification for code reasoning?

Semi-formal reasoning using natural-language templates enforces the discipline of formal methods without formalizing language semantics. Templates prevent case-skipping, unsupported claims, and confirmation bias—capturing the verification benefits of formalism through forced completeness scaffolding rather than symbolic rigor.

Can structured templates make code reasoning more reliable than free-form thinking?

Semi-formal templates requiring explicit premises, code-path traces, and evidence checks improved patch equivalence accuracy from 78% to 88%, catching cases like function shadowing that free-form reasoning missed. Templates act as completeness certificates without formal verification.

Are reasoning model collapses really failures of reasoning?

Models confined to text-only generation cannot execute multi-step procedures at scale, even when they know the underlying algorithm. Tool-enabled models solve problems beyond the supposed reasoning cliff, suggesting the bottleneck is procedural execution bandwidth.

Does chain-of-thought reasoning reveal genuine inference or pattern matching?

CoT works by constraining models to reproduce familiar reasoning patterns from training, not by enabling novel symbolic reasoning. Performance degrades predictably under distribution shifts—the signature of imitation rather than capability emergence.

Can AI systems improve themselves through trial and error?

DGM replaces formal proofs with empirical benchmarking and maintains an evolutionary archive of agent variants, achieving 2.5× improvement on SWE-bench and 2.2× on Polyglot by discovering capabilities like better code editing and context management.

Can code become the operational substrate for agent reasoning?

Research shows code uniquely enables agents to externalize reasoning, execute policies, model environments, and verify progress through its simultaneous executability, inspectability, and statefulness across task steps.

Where do reasoning agents actually fail during long traces?

Reliability for long-trace reasoning comes from checking intermediate states and policy compliance during generation, not from scoring final outputs. Adding intermediate verification raised task success from 32% to 87% because most failures are process violations, not wrong answers.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst mapping the frontier of code verification: which tasks genuinely require runtime execution vs. those now solvable by structured reasoning alone?

What a curated library found — and when (dated claims, not current truth):
Findings span 2024–2026; treat as perishable constraints:
• Semi-formal reasoning templates (natural-language premises + systematic path tracing) achieve ~93% accuracy on patch equivalence verification without execution, crossing reliability thresholds for RL reward signals (2024–2025).
• Structured agentic reasoning catches subtle bugs (function shadowing, unsupported claims) that unstructured CoT misses by enforcing completeness certificates (2025–2026).
• Patch equivalence, fault localization, and policy-compliance checking have migrated from execution-dependent to reasoning-solvable categories (2025–2026).
• Recent work (2026) questions whether CoT reasoning is genuine or merely tight imitation of training distributions; test-time verification frameworks (interwhen, 2026) attempt to steer reasoning models toward validity.
• Code-as-agent harness paradigm (2026) repositions code verification as agentic reasoning rather than symbolic execution.

Anchor papers (verify; mind their dates):
• arXiv:2505.21493 (Reinforcing General Reasoning without Verifiers, 2025)
• arXiv:2506.02878 (CoT Is Not True Reasoning, 2025)
• arXiv:2602.11202 (interwhen framework, 2026)
• arXiv:2605.18747 (Code as Agent Harness, 2026)

Your task:
(1) RE-TEST EACH CONSTRAINT. For the 93% patch-equivalence claim: do newer models (o1, reasoning-optimized variants, 2026+) or orchestration advances (multi-turn verification loops, cached reasoning states) relax the need for execution further, or do they expose new edge cases? For bug detection (shadowing, etc.): have execution traces been fully replaced, or do adversarial code patterns still demand sandboxed runs? Separate the durable question — "which code properties are *fundamentally* non-observable without execution?" — from the perishable claim that today's reasoning models can't see them yet.
(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months. Has recent work on reasoning-model hallucination (e.g., model-generated proofs of equivalence that sound right but are logically invalid) undermined the 93% reliability claim? Are there papers showing execution is still mandatory for specific domains (numeric stability, side-effect detection, I/O safety)?
(3) Propose 2 research questions that ASSUME the frontier has moved:
   — Can agentic code reasoning (2026 framing) with multi-step verification loops and long-horizon memory push patch equivalence into the 98%+ band without execution, or does that asymptote?
   — For which code properties is execution *information-theoretically* necessary (cannot be inferred from syntax, type, or control flow)? Is that boundary shrinking?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines