INQUIRING LINE

Inquiring lines›Where does language-model reasonin…›How do modularity, routing, and se…›How do language models inherit hum…›this inquiring line

Does an AI assigning low odds to its next word mirror the pause humans take when a question is hard?

Do token probability distributions in LLMs track human reaction time patterns?

This explores whether the probabilities an LLM assigns to its next token line up with how long humans take to respond — the idea that low-probability outputs are 'harder' for a model the way slow reactions mark difficulty for people.

This explores whether token probabilities behave like a difficulty signal that mirrors human reaction times. No note in the collection directly measures reaction-time latency against next-token probability, so the honest answer is that the corpus addresses the surrounding territory rather than the literal claim — but the surrounding territory is unexpectedly rich, and it points both ways. The strongest 'yes-ish' thread is the 'embers of autoregression' work, which treats an LLM as a probability machine and predicts where it will fail: tasks whose correct answer is a low-probability sequence (reciting the alphabet backwards, counting letters) get systematically harder even when they're logically trivial Can we predict where language models will fail?. That's the same shape as the human reaction-time story — rare, effortful responses cost more — except the model pays in accuracy rather than milliseconds. The probability distribution does encode a graded notion of 'effort,' which is the premise your question rests on.

Where it gets interesting is that the collection also shows LLMs capturing human cognition directly. Models fine-tuned on psychology-experiment data out-predict purpose-built cognitive models at forecasting human decisions, and they fold individual differences into their embeddings Can language models learn to model human decision making?. So the link between LLM internals and human behavioral data isn't hypothetical — it's been demonstrated for choices. Reaction time would be the natural next variable to test, and the fact that decision-prediction already works is the reason someone would expect token probabilities to track latency too.

The sharpest counterpoint comes from a note arguing the analogy breaks at the level of time itself. LLM generation is sequential but atemporal: tokens are selected by probability with no intervening pause, revision, or 'duration in reflection' Does AI text generation unfold through temporal reflection?. Human reaction time is precisely a measure of that reflective duration — the thinking-time that changes what comes next. If you take this seriously, a high-probability and a low-probability token are emitted in the same computational beat; any correlation with human RT would be a correlation of *difficulty rankings*, not of *process*. The model isn't slower on hard tokens; it's just less certain.

A useful refinement sits in the work on high-entropy 'forking' tokens: only about 20% of tokens carry real decision-weight, the pivotal branch points where the distribution is genuinely uncertain Do high-entropy tokens drive reasoning model improvements?. That's a more precise candidate for what might map to human hesitation than raw probability across all tokens — most tokens are near-deterministic filler that no human would deliberate over either. And the theory-of-mind note is the cautionary footnote: where LLMs look like they're doing human cognition, they often default to surface pattern-matching rather than the underlying mental process Do large language models genuinely simulate mental states?. The thing you'd want to take away: token probability plausibly tracks *which* outputs are hard, and the corpus shows that ranking transfers to human-difficulty in decision tasks — but 'reaction time' specifically asks about temporal effort, and the most direct note in the collection says that's exactly the dimension LLMs don't have.

Sources 5 notes

Can we predict where language models will fail?

By framing LLMs as autoregressive probability machines, researchers predicted tasks with low-probability target responses would be systematically harder, even when logically simple. Experiments confirmed predictions like backwards alphabet and letter counting.

Can language models learn to model human decision making?

LLMs finetuned on psychology experiment data predict human behavior more accurately than theory-driven models in decision tasks, capture individual differences in their embeddings, and transfer learning across tasks without task-specific design.

Does AI text generation unfold through temporal reflection?

Token ordering in LLMs follows probabilistic selection without intervening reflection or revision. Human discourse gains meaning from temporal structure—time spent thinking changes what comes next—but AI text production lacks this duration-in-reflection despite appearing sequentially composed.

Do high-entropy tokens drive reasoning model improvements?

Only ~20% of tokens exhibit high entropy as pivotal reasoning decision points; RLVR primarily adjusts these forking tokens. Training exclusively on them matches or exceeds full-gradient performance, revealing that the minority carries the learning signal.

Do large language models genuinely simulate mental states?

ChangeMyView and FANTOM benchmarks show LLMs fail at authentic perspective-taking in open-ended scenarios, despite succeeding on structured tasks. Hybrid Bayesian architectures that force explicit belief tracking outperform LLM-alone approaches, suggesting the gap is architectural rather than merely training-based.

Papers this line draws on 8

The research behind the notes this line reads — ranked by how closely each paper relates.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a researcher testing whether token probability distributions in LLMs encode human reaction-time difficulty signals. This remains an open empirical question.

What a curated library found — and when (dated claims, not current truth):
Findings span 2023–2026, clustered around three threads:
• Low-probability token sequences fail systematically in LLMs on tasks humans find effortful (alphabet backward, counting), mirroring reaction-time difficulty; probability encodes a graded 'effort' signal (~2024–2025).
• LLMs fine-tuned on psychology-experiment data out-predict cognitive models on human decisions and fold individual differences into embeddings, suggesting LLM internals capture human behavioral constraints (~2024).
• LLM token generation is atemporal: high and low-probability tokens emit in the same computational step with no intervening 'reflection duration' — the core mechanism of human reaction time (~2024–2025).
• Only ~20% of tokens carry real decision-weight (high-entropy 'forking' points); deterministic filler has no human analogue (~2025).
• LLMs default to surface-pattern matching in theory-of-mind tasks rather than underlying process (~2025).

Anchor papers (verify; mind their dates):
• arXiv:2402.17385 (2024-02) — LLM-assisted decision-making determinants
• arXiv:2502.07266 (2025-02) — chain-of-thought length and effort
• arXiv:2506.01939 (2025-06) — high-entropy tokens and RL efficiency
• arXiv:2502.08796 (2025-02) — theory of mind evaluation review

Your task:
(1) RE-TEST each constraint. Has inference-time optimization (KV cache, speculative decoding, adaptive compute) eliminated the atemporal assumption by introducing measurable latency variation? Have recent models with introspection or scratchpad mechanisms developed temporal structure? Separate: *difficulty ranking* (likely durable) from *temporal process* (likely still absent); cite what structure now exists.
(2) Surface the strongest work from the last 6 months that contradicts the 'atemporal' thesis or shows reaction-time correlation in a new modality (vision, multimodal, reasoning traces).
(3) Propose 2 questions assuming the regime shifted: (a) Do steering methods that amplify token uncertainty also amplify observable compute or latency? (b) Can high-entropy-token identification predict human hesitation better than raw probability?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Does an AI assigning low odds to its next word mirror the pause humans take when a question is hard?

Related lines of inquiry

Sources 5 notes

Papers this line draws on 8