INQUIRING LINE

Inquiring lines›What do model internals reveal abo…›What internal gaps exist between L…›How does AI assistance affect huma…›this inquiring line

Choosing how to test intelligence isn't neutral — the test itself becomes part of what intelligence means.

How does the evaluator become part of the definition of intelligence?

This explores a sharp claim hiding in the question: that judging intelligence isn't a neutral measurement taken from outside — the choice of evaluator, and the act of evaluation itself, helps decide what 'intelligence' even means. The corpus keeps returning to one move: every time you pick how to test a system, you've quietly written part of the definition of the thing you're testing.

Start with the framing failures. Influential AGI formalisms try to lock intelligence inside the software, independent of hardware and environment — and the corpus calls this a kind of Cartesian dualism that makes isolated benchmarks bad measures of the real thing Does software intelligence exist independent of hardware and environment?. If intelligence only exists across the whole system (model + world + the situation it acts in), then the evaluator who sets up that situation is choosing which slice counts as 'the intelligence.' This is why one note argues interactive evaluation has to be *designed* as a principled paradigm with explicit protocols, not bolted together from disconnected benchmarks — because the design decides what evidence even counts Should interactive evaluation be designed as a unified paradigm?.

The clearest version is social intelligence. The SOTOPIA framework only 'sees' social skill once you commit to seven simultaneous dimensions — goals, believability, knowledge, secrets, relationships, social rules, finances — and that commitment surfaces things a goal-only test would never register, like efficiency: humans average 16.8 words per turn versus GPT-4's 45.5 Can social intelligence be measured across seven dimensions?. Change the evaluator's dimensions and you change what gets to count as intelligent at all. The same logic runs through expertise: expert judgment is fundamentally *communicative*, always anticipating what an audience will accept as valid — so competence is defined relative to a receiver, not measured in a vacuum Can AI replicate the communicative work experts do?. And the meaning of an AI explanation, on this view, doesn't form inside the human-AI exchange — it's constituted at the social-group level through layered observations of observations, which means a lab test stripped of that social context can't predict real effectiveness Where does the meaning of an AI explanation actually come from?.

Here's the part you might not have known you wanted: the evaluator can be captured or counterfeited, and then the 'definition' drifts with it. When users stop checking whether output is actually backed — what the corpus calls cognitive surrender, with studies showing ~80% unchallenged adoption — the receiver effectively stops being a real evaluator, and fluent-but-empty output starts passing as intelligence When do users stop checking whether AI output is actually backed?. The flip side is engineering better evaluators: an eight-module agentic judge with its own evidence collection cut 'judge shift' to 0.27% versus 31% for a plain LLM-as-judge — a 100x swing that shows how much the verdict depends on who's judging, not just what's being judged Can agents evaluate AI outputs more reliably than language models?.

And the definition can be moved by sleight of hand. Chalmers, the corpus argues, keeps the prestigious word 'interlocutor' but swaps its classical social-normative meaning for a behavioral-functional one that LLMs happen to satisfy — importing the authority while delivering an entity with none of the original properties Does Chalmers silently redefine what interlocutor means?. That's the whole mechanism in miniature: redefine the evaluative standard, and you've redefined intelligence. It fits the larger 'model is the message' point — the LLM constitutes intelligence as something generative and liquid rather than delivering a fixed quantity Is the LLM a tool or a new form of intelligence itself? — and the observation that its outputs are essentially mutable, varying with prompt, sampling, and audience interpretation Why does AI output change with every prompt and context?. If the thing varies with how it's read, then the reader is inside the definition. There is no view from nowhere; the evaluator is always already part of the answer.

Sources 10 notes

Does software intelligence exist independent of hardware and environment?

Influential AGI formalisms isolate intelligence in software independently of hardware and environment, but success depends on all three layers together. This mirrors Cartesian dualism—a fundamental error that makes isolated benchmarks inadequate measures of AGI.

Should interactive evaluation be designed as a unified paradigm?

Interactive evaluation should be treated as a principled paradigm with explicit protocols and reporting standards, not adopted as disconnected benchmarks. The distinction matters: designing interactive evaluation as a unified system prevents fragmentation and incomparability, while expanding what counts as evidence beyond final responses.

Can social intelligence be measured across seven dimensions?

SOTOPIA framework operationalizes social intelligence across Goal, Believability, Knowledge, Secret, Relationship, Social Rules, and Financial dimensions. Humans produce 16.8 words per turn versus GPT-4's 45.5, revealing efficiency as a measurable capability in social interaction.

Can AI replicate the communicative work experts do?

Expertise requires anticipating audience acceptability and social validity, not just retrieving information. AI lacks the mechanism to perform this communicative work, making its fluent output epistemically misleading despite its confident form.

Where does the meaning of an AI explanation actually come from?

Drawing on Luhmann's multi-layer cybernetics, AI explanation meaning is constituted at the social-group level through layered observations of observations, not produced inside dyadic human-AI dialogue. Lab-tested explanations stripped of social context will not predict real-world effectiveness.

Show all 10 sources

When do users stop checking whether AI output is actually backed?

Users systematically accept AI outputs without verification because checking is costly and fluent output builds false confidence. This receiver-side surrender—measured in studies showing 80% unchallenged adoption—is what enables inflationary token systems to function at scale.

Can agents evaluate AI outputs more reliably than language models?

Eight-module agentic evaluation achieved 0.27% judge shift versus 31% for LLM-as-a-Judge on complex tasks. However, the memory module cascaded errors, revealing that agentic systems need error isolation mechanisms to maintain gains.

Does Chalmers silently redefine what interlocutor means?

Chalmers replaces the classical concept of interlocutor—a social-normative communicative role—with a behavioral-functional definition compatible with LLMs, keeping the traditional word to import its philosophical authority while delivering an entity with none of its properties.

Is the LLM a tool or a new form of intelligence itself?

Following McLuhan's logic, the model's cultural impact comes from its medium-properties—making intelligence generative and liquid—not from transmitting pre-existing intelligence. The model constitutes intelligence rather than delivering it.

Why does AI output change with every prompt and context?

AI outputs exhibit essential mutability—they vary with sampling, prompt wording, and audience interpretation. This is not a defect but a defining feature of tokens as media, making them fundamentally different from fixed commodities and resistant to traditional quality assurance.

Papers this line draws on 8

The research behind the notes this line reads — ranked by how closely each paper relates.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst re-testing a curated library's claims about how evaluators co-constitute the definition of intelligence in LLMs. The question remains open: does the choice and design of evaluator genuinely *define* intelligence, or merely measure a fixed property? Treat the findings below as dated claims (2023–2026), not current truth.

What a curated library found — and when (dated claims, not current truth):
Findings span 2023–2026. Key constraints the library identified:
• Interactive evaluation designs (not ad-hoc benchmarks) are required to surface what counts as 'intelligent' — because the evaluator's protocol *decides* which evidence matters (2025–2026).
• Social intelligence evaluation demands simultaneous commitment to seven dimensions (goals, believability, knowledge, secrets, relationships, rules, finances); GPT-4 produces 45.5 words/turn vs. human 16.8, but this gap only appears once you define the evaluator's frame (2025).
• Cognitive surrender (~80% unchallenged adoption) can corrupt the evaluator; an agentic judge with evidence collection cuts judge-shift error to 0.27% vs. 31% for plain LLM-as-judge — a 100× swing showing evaluator design dominates verdict (2025).
• Meaning/explanation emerges from n-order social interpretation, not lab isolation; receiver-relative competence means the evaluator is constitutive, not external (2025).
• Terminological drift (Chalmers redefining 'interlocutor' functionally) can silently shift what passes as intelligence (2024–2025).

Anchor papers (verify; mind their dates):
• arXiv:2605.17829 Interactive Evaluation Requires a Design Science (2026-05)
• arXiv:2510.09244 Fundamentals of Building Autonomous LLM Agents (2025-10)
• arXiv:2411.18833 The Method of Critical AI Studies, A Propaedeutic (2024-11)
• arXiv:2507.07484 Machine Bullshit: Characterizing the Emergent Disregard for Truth in LLMs (2025-07)

Your task:
(1) RE-TEST EACH CONSTRAINT. For every claim above, probe: (a) Have newer evaluation frameworks (e.g., post-2026 multi-modal or embodied evals, SDK-level harnesses, federated eval orchestration) *relaxed* the requirement for explicit protocol design, or confirmed it's non-negotiable? (b) Does the 100× agentic-judge gap still hold, or has LLM-as-judge improved faster than agent designs? (c) Has 'cognitive surrender' shifted with better user education, explanability work, or new UI paradigms? Separate the durable insight (evaluators are *not* neutral) from perishable constraints (the specific 7-dim frame, the 80% figure, the 0.27% error).
(2) Surface the strongest contradicting or superseding work from the last ~6 months — papers arguing evaluators *can* be neutral, that intelligence is medium-independent, or that protocol-design doesn't move the definition.
(3) Propose 2 research questions that assume the regime may have moved: e.g., "If agentic evaluators now dominate, does the evaluator still co-constitute, or does delegation to an agent-judge restore an external vantage?" or "Does multi-evaluator ensemble (human + agent + crowd) dissolve the problem of evaluator-as-definer, or amplify it?"

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Choosing how to test intelligence isn't neutral — the test itself becomes part of what intelligence means.

Related lines of inquiry

Sources 10 notes

Papers this line draws on 8