INQUIRING LINE

Inquiring lines›What makes reasoning better — more…›Why do models show mismatched conf…›Is embodied interaction necessary…›this inquiring line

The nuances AI discards as noise are often exactly what humans need to act well in a specific situation.

What fine-grained distinctions matter most for human situated action in categories?

This explores what the corpus calls 'situated action' — the idea that humans hold onto subtle, context-dependent distinctions inside a category that machines tend to flatten away, and asks which of those distinctions actually do the work.

This explores why humans keep fine-grained shadings inside a category when LLMs tend to throw them out — and what those kept distinctions buy us. The cleanest anchor in the collection comes from a Rate-Distortion analysis showing that LLMs and humans make opposite trades: models aggressively compress concepts to capture the broad shape of a category, while humans deliberately *under*-compress, preserving nuance that doesn't improve efficiency but does enable acting well in a specific situation Do LLMs compress concepts more aggressively than humans do?. So the answer to 'which distinctions matter most' is reframed: the ones that matter are precisely the ones a pure compressor would discard as noise, because their payoff is contextual rather than statistical.

What does that nuance look like concretely? Several notes suggest the load-bearing distinction is rarely the category label itself but *how strongly something belongs and in what context*. Work on projective content finds that meaning is gradient, not binary — the very same word projects more or less depending on what question is currently being discussed, not on any fixed property of the word Does projection strength vary by context or by word type?. That is situated action in miniature: the distinction that matters shifts with the conversational moment. The same lesson shows up in how human judgments decompose — annotation responses aren't one signal but three (genuine preferences, non-attitudes, and on-the-spot constructed preferences), and treating them as interchangeable corrupts everything downstream Do all annotation responses measure the same underlying thing?. The fine-grained distinction there is *why* a person gave an answer, which a coarse category collapses entirely.

There's a structural reason machines lose this. Embedding spaces organize themselves coarse-to-fine — leading eigenvectors split the big taxonomic branches first and only resolve fine sub-branches later, mirroring a hypernym tree Do embedding eigenvectors organize taxonomy from coarse to fine?. And the semantic dimensions inside those embeddings entangle: nudge one feature and aligned features move with it, so models can't isolate a fine distinction without off-target drift Do LLM semantic features organize along human evaluation dimensions?. The architecture is built to get the gist cheaply, which is exactly why the contextual fine print falls out.

The interesting twist is that this gap may be closable with *structure* rather than scale. When visual reasoning on social tasks is broken into staged cognitive scaffolding — embodied perception, then reading the situation, then norm-grounded interpretation — performance jumps, suggesting the missing ingredient was situated structure, not more raw reasoning Can breaking down visual reasoning into three stages improve model performance?. Conversely, piling on verbose reasoning actively *degrades* fine-grained perception because it optimizes the wrong bottleneck Does verbose chain-of-thought actually help multimodal perception tasks?. And teaching quality judgments fails when models only see labeled examples — they learn surface patterns instead of the principled criteria that let a distinction generalize to new cases Can models learn argument quality from labeled examples alone?.

The thing you might not have known you wanted: the distinctions that matter most for situated action are the ones with the *worst* compression ratio — high cost to store, low statistical payoff, but decisive in the moment. Even the deepest human/LLM divide turns out to be perspective-dependent: from the outside the two systems look categorically different, but inside a shared discourse they draw on the same symbolic substrate Do humans and LLMs differ fundamentally or just superficially?. Which means the gap in fine-grained, situated judgment is less a hard wall than a difference in what each system is willing to pay to keep.

Sources 9 notes

Do LLMs compress concepts more aggressively than humans do?

Using Rate-Distortion Theory on cognitive datasets, LLMs capture broad category structure but lose fine-grained distinctions humans preserve. LLMs maximize compression efficiency; humans trade compression for contextual meaning that enables situated action.

Does projection strength vary by context or by word type?

Across 19 English expressions, projectivity varies continuously based on whether content addresses the Question Under Discussion. The same presupposition trigger projects more or less depending on context, not on fixed lexical properties.

Do all annotation responses measure the same underlying thing?

Behavioral science reveals that annotations contain genuine preferences, non-attitudes, and constructed preferences—distinguishable by consistency across measurement conditions. Treating them uniformly contaminates reward model training and downstream alignment.

Do embedding eigenvectors organize taxonomy from coarse to fine?

Leading eigenvectors of embedding Gram matrices separate broad taxonomic branches first, then progressively finer sub-branches—a coarse-to-fine spectral order that tracks the WordNet hypernym tree level by level, confirming predictions from co-occurrence statistics.

Do LLM semantic features organize along human evaluation dimensions?

Twenty-eight semantic axes in LLM embeddings reduce to three principal components matching human EPA structure. Intervening on one feature predictably shifts aligned features proportionally, creating unavoidable off-target effects that reflect how meaning is fundamentally organized.

Show all 9 sources

Can breaking down visual reasoning into three stages improve model performance?

CoCoT structures VLM reasoning through embodied perception, embedded situation analysis, and norm-grounded interpretation, achieving +8% improvement over flat CoT on social benchmarks. The gains suggest cognitive structure matters more than reasoning volume for social tasks.

Does verbose chain-of-thought actually help multimodal perception tasks?

Long rationales and text-token RL help reasoning but hurt fine-grained perception tasks because the actual bottleneck is visual attention allocation, not verbalization. Standard CoT optimization trains the wrong policy target.

Can models learn argument quality from labeled examples alone?

Fine-tuning on labeled examples fails to transfer quality criteria to new argument types. Models learn surface patterns rather than principled criteria. Explicit instruction using frameworks like RATIO or QOAM significantly improves performance and generalization.

Do humans and LLMs differ fundamentally or just superficially?

Applied Habermas's observer/participant distinction to AI: from outside, humans and LLMs are utterly different; from within shared discourse, both draw on the same symbolic substrate, making the difference structural rather than absolute.

Papers this line draws on 8

The research behind the notes this line reads — ranked by how closely each paper relates.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst. The question remains open: *What fine-grained distinctions matter most for human situated action in categories, and can LLMs be taught to preserve them?*

What a curated library found — and when (dated claims, not current truth):
Findings span 2024–2026 and include:
• LLMs aggressively compress concepts to broad statistical patterns while humans deliberately under-compress, preserving contextual nuance with poor compression ratio but high situational payoff (2025-05).
• Meaning projects gradient-wise depending on conversational context, not as fixed category membership; the same word's semantic weight shifts with the question being discussed (2025-08).
• Embedding spaces organize coarse-to-fine mirroring hypernym trees, and semantic features are entangled in low-dimensional structure, making fine distinctions hard to isolate without off-target drift (2025-08, 2026-05).
• Structured cognitive scaffolding (embodied perception → situation reading → norm interpretation) improves fine-grained social reasoning; verbose reasoning degrades it by optimizing the wrong bottleneck (2025-07).
• Annotation responses decompose into three signal types (genuine preferences, non-attitudes, constructed on-the-spot); treating them as one collapses the contextual distinction that matters (2026-01).

Anchor papers (verify; mind their dates):
• 2025-05 arXiv:2505.17117 — *From Tokens to Thoughts: How LLMs and Humans Trade Compression for Meaning*
• 2025-08 arXiv:2508.12863 — *Word Meanings in Transformer Language Models*
• 2025-07 arXiv:2507.20409 — *Cognitive Chain-of-Thought: Structured Multimodal Reasoning about Social Situations*
• 2026-01 arXiv:2604.03238 — *Measuring Human Preferences in RLHF is a Social Science Problem*

Your task:
(1) **RE-TEST EACH CONSTRAINT.** For every finding above—compression trade-offs, gradient projection, entanglement, scaffolding gains, annotation decomposition—judge whether newer training methods (e.g., mixture-of-experts, retrieval augmentation, situated fine-tuning), multimodal integration, or scaffolding orchestration (memory+caching across dialogue) have since relaxed or overturned each limitation. Separate the durable question (likely: how to weight contextual relevance in embeddings) from perishable constraints (possibly: entanglement is unavoidable). Cite what resolved it.
(2) **Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months.** Look especially for papers showing LLMs *do* preserve fine-grained distinctions under specific training regimes, or arguing that compression and nuance are not actually opposed.
(3) **Propose 2 research questions that ASSUME the regime may have moved:** e.g., "Can situated fine-tuning on domain-specific annotation decompositions teach models to weight context-dependent projections?" or "Do retrieval-augmented embeddings escape entanglement enough to isolate nuanced distinctions in live dialogue?"

Cite arXiv IDs; flag anything you cannot ground in a real paper.

The nuances AI discards as noise are often exactly what humans need to act well in a specific situation.

Related lines of inquiry

Sources 9 notes

Papers this line draws on 8