INQUIRING LINE

Inquiring lines›What do model internals reveal abo…›How should agents manage informati…›How do we evaluate AI systems when…›this inquiring line

When machines can check each other's work automatically, AI can loop through thousands of attempts at runtime and discover genuinely new things.

How does machine feedback enable discovery at test time?

This explores how automated evaluation signals — machines checking machines, rather than humans labeling answers — let AI systems search their way to genuinely new results while they're running. The corpus converges on a single mechanism: discovery becomes feasible when verification is cheaper than generation. AlphaEvolve is the clearest case — automated evaluators sustain an evolutionary loop long enough to produce real artifacts (faster algorithms, better hardware layouts, improved training methods), because objective scoring closes the gap where a system can afford to generate many candidates and keep only what verifiably works Can machine feedback sustain discovery at test time?. The discovery isn't in any single generation; it's in being able to run the generate-check loop thousands of times without a human in the seat.

That reframes test-time compute as a tunable budget. Agentic deep-research systems show a scaling law where adding search iterations improves answers along the same monotonic-but-diminishing curve as adding reasoning tokens — search becomes a second inference-compute axis you can spend against Does search budget scale like reasoning tokens for answer quality?. But spending compute only pays off if something steers it. That something is the feedback signal, and the corpus's most interesting move is showing how rich signals can be manufactured for free. Tree search is the workhorse: MCTS outcomes plus critic models produce dense rewards equivalent to human labels, replacing the annotation oracle RLHF needs Can tree search replace human feedback in LLM training?, and the *structure* of the tree does double duty — random expansion depth yields supervision at multiple granularities automatically, coarse strategy signals near the root and fine detail at the leaves, with no labeling effort Does tree depth automatically produce supervision at multiple granularities?.

The recurring lesson is that feedback on the *process* beats feedback on the final answer. Supervising intermediate retrieval steps substantially outperforms rewarding only the final output in agentic RAG Does supervising retrieval steps outperform final answer rewards?, and clever systems mine process signal from places you wouldn't expect — LongTraceRL extracts reasoning quality from the hard distractors a search agent reads but doesn't cite, which structurally blocks the model from fabricating its own reward Can search agent behavior yield reliable process rewards for reasoning?. Even confidence is a usable signal: models that commit early and rationalize show measurably worse reasoning, and rewarding gradual confidence growth improves accuracy by tens of points with no process labels at all Can confidence trajectories reveal when reasoning goes wrong?. The deeper claim across these is that interaction itself is a feedback firehose — every agent action produces a next-state signal (a tool result, an error, a GUI change) that can train behavior directly, unifying learning across tasks under one loop Can agent deployment itself generate training signals automatically?.

Push this far enough and the system starts improving its own search machinery. A bilevel autoresearch loop read its inner loop's code, found bottlenecks, and wrote new optimization mechanisms at runtime — discovering bandit and combinatorial methods that broke its old deterministic patterns and delivered a 5x gain Can an AI system improve its own search methods automatically?. That's discovery feeding back into the discovery process itself, which is exactly where the failure modes get sharp. Self-generated feedback can poison the well: bidirectional RAG only safely grows its own corpus because write-back is gated by entailment, attribution, and novelty checks that keep hallucinations from contaminating future retrievals Can RAG systems safely learn from their own generated answers?. And the evaluator quality ceiling is real — agentic evaluation with evidence collection cut judge error a hundredfold over a plain LLM judge, but its memory module cascaded errors, a reminder that the verifier is now load-bearing infrastructure Can agents evaluate AI outputs more reliably than language models?. The thing you didn't know you wanted to know: machine feedback doesn't just speed up discovery, it changes what counts as discoverable — once a result can be cheaply verified, finding it becomes a compute problem, and closed-loop automated review is already being proposed as the publication venue for research no human author wrote Can automated review loops handle AI-generated research at scale?.

Sources 12 notes

Can machine feedback sustain discovery at test time?

AlphaEvolve demonstrates that automated evaluators can sustain evolutionary loops long enough to produce real discoveries—faster algorithms, optimized hardware designs, and improved training methods. The key is that cheap, objective verification closes the generation-verification gap where discovery becomes computationally feasible.

Does search budget scale like reasoning tokens for answer quality?

Agentic deep research shows monotonic-to-diminishing-returns curves for search iterations, matching reasoning token scaling. This creates a new inference-compute axis: models can trade off reasoning budget against search budget to optimize answer quality.

Can tree search replace human feedback in LLM training?

AlphaLLM uses tree search outcomes and three critic models to derive dense reward signals equivalent to human-labeled feedback. Tree structure naturally ranks solution paths by success, replacing the annotation oracle that standard RLHF requires.

Does tree depth automatically produce supervision at multiple granularities?

Tree-GRPO's random expansion strategy naturally produces supervision at varying granularities—early branches provide coarse strategy-level signals while late branches provide fine-grained detail supervision. This multi-resolution signal emerges from sampling structure alone, without annotation effort or granularity scheduling.

Does supervising retrieval steps outperform final answer rewards?

Fine-grained feedback on intermediate retrieval steps significantly boosts agentic RAG performance compared to final-answer-only rewards. DPO trained with both positive and negative step feedback outperforms PPO and single-direction training by directly contrasting good and bad retrieval chains.

Show all 12 sources

Can search agent behavior yield reliable process rewards for reasoning?

LongTraceRL mines entity-level reasoning signals from what search agents read but don't cite—the hardest distractors—and applies rubric rewards only to correct answers, structurally blocking reward fabrication while capturing intermediate reasoning quality.

Can confidence trajectories reveal when reasoning goes wrong?

Models that commit to answers early then rationalize show measurable flawed reasoning. Rewarding gradual confidence growth via RL improves accuracy significantly—on Countdown by 42 percentage points—without needing process labels or external reward models.

Can agent deployment itself generate training signals automatically?

Every agent action produces a next-state signal (user reply, tool output, error, GUI change) that can train the policy directly. This universal signal source eliminates the need for separate training datasets across conversations, terminal tasks, SWE, and tool use.

Can an AI system improve its own search methods automatically?

An outer loop successfully read inner loop code, identified bottlenecks, and generated new Python mechanisms at runtime, discovering combinatorial optimization and bandit methods that broke the inner loop's deterministic patterns and improved performance on GPT pretraining by 5x.

Can RAG systems safely learn from their own generated answers?

Systems can add generated answers to their retrieval corpus when outputs pass entailment verification, source attribution checks, and novelty detection. This prevents hallucinations from polluting future retrievals while allowing genuine knowledge accumulation.

Can agents evaluate AI outputs more reliably than language models?

Eight-module agentic evaluation achieved 0.27% judge shift versus 31% for LLM-as-a-Judge on complex tasks. However, the memory module cascaded errors, revealing that agentic systems need error isolation mechanisms to maintain gains.

Can automated review loops handle AI-generated research at scale?

aiXiv demonstrates that iterative review-refine cycles with automated retrieval-augmented evaluation and prompt-injection defenses measurably enhance proposal and paper quality, addressing the structural gap where AI-generated research lacks appropriate publication venues.

Papers this line draws on 8

The research behind the notes this line reads — ranked by how closely each paper relates.

A Survey on Test-Time Scaling in Large Language Models: What, How, Where, and How Well?2.49 match · arxiv ↗
Training Large Language Models for Reasoning through Reverse Curriculum Reinforcement Learning2.43 match · arxiv ↗
Bilevel Autoresearch: Meta-Autoresearching Itself1.77 match · arxiv ↗
UR2: Unify RAG and Reasoning through Reinforcement Learning1.69 match · arxiv ↗
OMNI-SIMPLEMEM: Autoresearch-Guided Discovery of Lifelong Multimodal Agent Memory1.67 match · arxiv ↗
What Does It Take to Be a Good AI Research Agent? Studying the Role of Ideation Diversity1.66 match · arxiv ↗
Towards Agentic RAG with Deep Reasoning: A Survey of RAG-Reasoning Systems in LLMs1.66 match · arxiv ↗
Retrieval-augmented reasoning with lean language models1.65 match · arxiv ↗

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst re-testing claims about machine feedback and test-time discovery. The question: How does automated evaluation (machines checking machines) enable AI systems to discover genuinely new results during inference?

What a curated library found — and when (dated claims, not current truth):
Findings span 2023–2026; treat each as perishable.
• Discovery scales with test-time compute via search iteration, following a monotonic-but-diminishing curve similar to reasoning tokens (~2025–2026).
• Tree search + critic models can replace human annotation by generating dense process-level rewards at multiple granularities—coarse strategy at tree root, fine detail at leaves (~2026).
• Process supervision (rewarding intermediate steps, e.g., retrieval correctness) substantially outperforms outcome-only reward in agentic RAG (~2025–2026).
• Self-improving loops risk feedback poisoning; bidirectional RAG mitigates it via entailment/novelty gates, but verifier memory cascades errors—the evaluator is now load-bearing (~2026).
• Bilevel autoresearch discovered new optimization mechanisms at runtime, yielding 5× gains; self-generated feedback feeding back into search machinery itself is real but fragile (~2026).

Anchor papers (verify; mind their dates):
• arXiv:2506.13131 (AlphaEvolve, 2025-06) — automated evaluators sustaining evolutionary loops for artifact discovery.
• arXiv:2509.21240 (Tree Search for LLM Agent RL, 2025-09) — MCTS + critic rewards replacing annotation oracles.
• arXiv:2603.23420 (Bilevel Autoresearch, 2026-03) — meta-optimization via self-modifying search code.
• arXiv:2605.24396 (Premature Confidence, 2026-05) — confidence dynamics as annotation-free reasoning signal.

Your task:
(1) RE-TEST EACH CONSTRAINT. For every finding above, check whether newer models, methods (e.g., new RLHF variants, synthetic-data pipelines), inference harnesses (batch reranking, speculative decoding, better tree implementations), or evaluation protocols have since RELAXED or OVERTURNED it. Separate the durable question (likely still open: *how much discovery is achievable without human labels?*) from perishable limitations (e.g., *verifier cascade error is unsolvable* — cite what fixed it). Be plain: does the constraint still hold?
(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months (post-Sept 2026). What pushes back on the "process >> outcome" thesis, or shows discovery plateaus regardless of feedback richness?
(3) Propose 2 research questions that assume the regime may have moved: one on the *cost* frontier (verification still cheaper? at what scale?), one on *discovery novelty* (are discovered artifacts reproducible, or artifacts of overfitting to the evaluator?).

Cite arXiv IDs; flag anything you cannot ground in a real paper.

When machines can check each other's work automatically, AI can loop through thousands of attempts at runtime and discover genuinely new things.

Related lines of inquiry

Sources 12 notes

Papers this line draws on 8