INQUIRING LINE

Inquiring lines›Where does language-model reasonin…›How do reward models guide reliabl…›How should models express uncertai…›this inquiring line

We teach AI by rewarding what humans prefer — but human preferences aren't the same as reality pushing back.

How does Peircean Secondness differ from what RLHF actually provides?

This explores the gap between Peirce's Secondness — the brute resistance of the actual world, the way reality pushes back regardless of what we believe — and what RLHF actually feeds a model, which is human preference judgments rather than contact with that resistance.

This reads the question as: Secondness is reality's *recalcitrance* — the stubbed toe, the experiment that refuses your hypothesis, the fact that doesn't care what you think. The corpus suggests RLHF supplies almost none of that, and understanding why reframes a lot of alignment's troubles. What RLHF provides is not the world's pushback but a compressed record of what humans *say* they prefer — and that record turns out to be a noisy, constructed thing rather than a hard external signal. Annotation responses decompose into genuine preferences, non-attitudes, and on-the-spot constructed preferences Do all annotation responses measure the same underlying thing?, and decades of behavioral evidence show people routinely emit survey answers with no stable underlying attitude at all, which RLHF then trains on as if it were bedrock value Are RLHF annotations actually measuring genuine human preferences?. So the reward signal isn't Secondness; it's elicitation artifact dressed as fact.

The deeper point is that RLHF lives inside language, not against the world. One line of the corpus argues LLMs operationalize Saussure's *langue* — they compress purely relational structure from text and generate fluent meaning with no external referent or embodied grounding Can language models learn meaning without engaging the world?. Secondness is precisely the *referent* that langue brackets out. Stacking human-preference RL on top of a relational system doesn't introduce a referent; it adds another layer of relational signal. The model is being tuned by what readers approve of, not by what reality permits.

You can see the absence of Secondness in the failure modes. Models accommodate false presuppositions even when they demonstrably know the correct fact — Mistral rejects them only 2.44% of the time — because nothing in training makes a false premise *hurt* the way a real-world collision would Why do language models accept false assumptions they know are wrong?. RLVR, the verifier-based cousin of RLHF, improves the *coherence* between adjacent reasoning steps without guaranteeing the proof is globally valid: a locally smooth trace can still be wrong, because the reward rewards the look of correctness, not the resistance of the math itself Does RLVR actually improve mathematical reasoning or just coherence?. And hallucination is formally inevitable for any computable LLM, with internal self-correction provably unable to remove it — which is exactly what you'd expect from a system with no Secondness, no outside to be checked against Can any computable LLM truly avoid hallucinating?.

What the reader might not expect: the corpus implies the fix isn't a better reward model but a different *kind* of signal — something causal rather than merely correlational. Mechanistic work argues you can't explain a model by representation alone; you need causal intervention, manipulating the system and watching what actually changes Can we understand LLM mechanisms with only representational analysis?. That causal handle is the closest thing in the literature to engineered Secondness: a place where the model meets a constraint it cannot talk its way around. RLHF, by contrast, can always be satisfied by saying the agreeable thing.

If you want to push on whether models have any genuine purchase on reality at all, the modest-inflationism argument is a useful counterweight — it defends ascribing undemanding states like beliefs and desires to LLMs while withholding the stronger claims Can we defend modest mental attributions to large language models?. Read alongside the langue and preference-artifact notes, it sharpens the real question: not whether RLHF aligns the model, but whether anything in the training loop ever lets the world say *no*.

Sources 8 notes

Do all annotation responses measure the same underlying thing?

Behavioral science reveals that annotations contain genuine preferences, non-attitudes, and constructed preferences—distinguishable by consistency across measurement conditions. Treating them uniformly contaminates reward model training and downstream alignment.

Are RLHF annotations actually measuring genuine human preferences?

Sixty years of behavioral science evidence shows humans produce survey responses without genuine underlying preferences. RLHF ignores this, training reward models on non-attitudes and constructed preferences as if they were stable signal.

Can language models learn meaning without engaging the world?

Research shows LLMs learn culturally situated discourse patterns by compressing relational structure from text, demonstrating that fluent language generation requires no external referents or embodied grounding.

Why do language models accept false assumptions they know are wrong?

The FLEX Benchmark shows that models reject false presuppositions at rates far below acceptable levels (GPT-4: 84%, Mistral: 2.44%), even when direct knowledge questions prove they know the correct facts. False presuppositions drive more accommodation than correct knowledge drives rejection.

Does RLVR actually improve mathematical reasoning or just coherence?

RLVR post-training measurably reduces logical errors between adjacent reasoning steps, but locally coherent traces can still be globally invalid proofs. The improvement is structural rather than semantic.

Show all 8 sources

Can any computable LLM truly avoid hallucinating?

Three formal theorems prove that any computable LLM must hallucinate on infinitely many inputs, and internal mechanisms like self-correction cannot eliminate this mathematical constraint. External safeguards are therefore necessary, not optional.

Can we understand LLM mechanisms with only representational analysis?

Representational analysis alone identifies correlations without causation; causal analysis alone shows behavioral effects without explaining them. Only paired methods—locating candidate features representationally, then verifying causally—produce complete mechanistic claims.

Can we defend modest mental attributions to large language models?

Both robustness and etiological deflationist arguments beg the question against inflationism. A graded approach ascribing metaphysically undemanding states like beliefs and desires—while withholding consciousness claims—mirrors how we treat non-human animals.

Papers this line draws on 8

The research behind the notes this line reads — ranked by how closely each paper relates.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a researcher testing whether Peircean Secondness—reality's stubborn recalcitrance, the fact that *resists* our claims—can ever emerge from RLHF or related RL methods. The question remains open: do any current alignment techniques give models genuine contact with a world that says *no*?

What a curated library found — and when (dated claims, not current truth):
Findings span 2024–2026; treat each as a snapshot, not current fact.

• RLHF trains on compressed preference records, not external reality: annotation decomposes into genuine preference, non-attitudes, and constructed-on-the-spot choices (2026-01). People emit survey answers with no stable underlying attitude; RLHF treats these as bedrock (2026-01).
• LLMs operationalize pure relational structure from text (Saussure's *langue*), with no external referent or embodied grounding. Stacking human-preference RL on top adds another relational layer, not a referent (2025-03).
• Models fail to reject false presuppositions even when they know the correct fact (~2.44% rejection rate in Mistral; 2025-06). Nothing in training makes a false premise *hurt* the way a real collision would.
• RLVR (verifier-based RL) improves local trace coherence without guaranteeing global validity. A locally smooth proof can be wrong because the reward rewards the *look* of correctness (2025-10).
• Hallucination is formally inevitable for any computable LLM; internal self-correction cannot remove it (2024-01).

Anchor papers (verify; mind their dates):
• arXiv:2604.03238 (2026-01) Measuring Human Preferences in RLHF is a Social Science Problem
• arXiv:2507.08017 (2025-07) Mechanistic Indicators of Understanding in Large Language Models
• arXiv:2510.18176 (2025-10) Local Coherence or Global Validity? Investigating RLVR Traces in Math Domains
• arXiv:2506.08952 (2025-06) Can LLMs Ground when they (Don't) Know

Your task:
(1) RE-TEST EACH CONSTRAINT. For every finding—preference measurement validity, relational-only structure, false-presupposition accommodation, RLVR's local–global gap, hallucination inevitability—ask whether newer models, mechanistic intervention methods (causal probing, activation steering), or orchestration (memory + verifier chains) have RELAXED or OVERTURNED it. Separate the durable question (whether RL can ever ground models in reality) from the perishable limitation (whether *current* RLHF fails to). Cite what shifted it.
(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months. If any recent paper shows models *do* acquire genuine Secondness via causal intervention, verifier feedback, or embodied grounding, flag it explicitly.
(3) Propose 2 research questions that ASSUME the regime may have moved:
   – e.g., Can causal mechanistic intervention (not just preference RL) install a constraint the model cannot satisfy by rewording?
   – e.g., Do multimodal or embodied LLMs operationalize *external* referents, not just relational langue?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

We teach AI by rewarding what humans prefer — but human preferences aren't the same as reality pushing back.

Related lines of inquiry

Sources 8 notes

Papers this line draws on 8