What would whole-system AGI evaluation look like in practice?
This explores what it would actually mean to evaluate AGI as a whole system — hardware, software, environment, and the humans steering it — rather than scoring a model in isolation on a benchmark.
This explores what whole-system AGI evaluation looks like in practice, and the corpus has a surprisingly coherent answer hiding across several notes: stop measuring the model alone, because the thing you call 'intelligence' was never living in the model alone. The starting move is a critique. Most influential AGI formalisms quietly assume intelligence is a property of software you can isolate and benchmark — a stance Does software intelligence exist independent of hardware and environment? calls computational dualism, the AI-era echo of Descartes splitting mind from body. If success actually depends on software, hardware, and environment working together, then any test that strips two of those layers away is measuring a fiction.
So what replaces it? The most concrete proposal is to change the unit of analysis entirely. Should we evaluate deployed agents as whole environments instead? argues the correct thing to evaluate is the coupled human–agent–environment, and backs it with a case study of 75,671 telemetry records showing that real capability gains came from accumulated context and reusable procedures that only exist across sessions, under human direction. In other words, the interesting intelligence lived in the relationship over time, not in any single response — and episode-level scoring is structurally blind to it. That reframes evaluation from 'grade the answer' to 'instrument the whole working system as it operates.'
If you're instrumenting a live system, who does the grading? Here the corpus suggests the judge itself has to become a system. Can agents evaluate AI outputs more reliably than language models? shows an eight-module agentic evaluator that collects evidence dynamically and cut 'judge shift' from 31% (a plain LLM acting as judge) to 0.27% — but its memory module also cascaded errors, which is the whole-system lesson in miniature: evaluators inherit the same fragility as the things they evaluate, and need error isolation built in. Governance points the same direction. Can governance rules embedded in runtime memory actually protect autonomous agents? found that safety rules baked into the agent's runtime memory (889 governance events over 96 days) worked because the agent actually consulted them mid-decision — evaluation and oversight as something woven into the operating environment, not bolted on afterward.
The deepest practical shift is what you measure once outputs stop being trustworthy signals. Can we measure reasoning quality beyond output plausibility? offers three testable structural properties — traceability, counterfactual adaptability, and motif compositionality — for telling genuine causal reasoning apart from fluent mimicry. Why this matters is sharpened by Does logical validity actually drive chain-of-thought gains?: chains of thought that are logically nonsense score nearly as well as valid ones, meaning a model can learn the *form* of reasoning without the substance — and an output-only benchmark would never notice. Add Why does AI output change with every prompt and context?, which argues outputs are inherently mutable across prompt, sampling, and audience, and the case closes: scoring a frozen output is scoring one roll of the dice.
Put together, whole-system AGI evaluation in practice looks less like an exam and more like field instrumentation of an ongoing partnership: define the unit as human + agent + environment over time; measure structural reasoning properties and cross-session learning rather than one-shot correctness; make the evaluator a fault-isolated system in its own right; and embed oversight where decisions are actually made. The thing you'd discover you wanted to know: the reason today's benchmarks feel hollow isn't that they're too easy — it's that they're measuring the wrong object.
Sources 7 notes
Influential AGI formalisms isolate intelligence in software independently of hardware and environment, but success depends on all three layers together. This mirrors Cartesian dualism—a fundamental error that makes isolated benchmarks inadequate measures of AGI.
A single-investigator case study with 75,671 telemetry records shows that capacity gains come from accumulated context and reusable procedures that only exist across sessions with human direction. Model and episode-level evaluation cannot measure these cross-session variables.
Eight-module agentic evaluation achieved 0.27% judge shift versus 31% for LLM-as-a-Judge on complex tasks. However, the memory module cascaded errors, revealing that agentic systems need error isolation mechanisms to maintain gains.
A persistent agent recorded 889 governance events across 96 active days, with safeguards encoded directly into the memory layer the agent consulted during operation. Runtime-resident governance proved more effective than external policies because the agent actually accessed it during decision-making.
Research identifies traceability, counterfactual adaptability, and motif compositionality as testable measures of human-like reasoning. These structural properties reveal whether an agent genuinely reasons causally or merely mimics coherent speech.
Illogical chain-of-thought exemplars matched valid CoT performance on BIG-Bench Hard, showing that structural properties—not logical validity—drive the gains. The model learns the form of reasoning, not genuine inference.
AI outputs exhibit essential mutability—they vary with sampling, prompt wording, and audience interpretation. This is not a defect but a defining feature of tokens as media, making them fundamentally different from fixed commodities and resistant to traditional quality assurance.