SYNTHESIS NOTE
Reasoning, Retrieval, and Evaluation Training, RL, and Test-Time Scaling Model Architecture and Internals

Can we measure how deeply a model actually reasons?

What if reasoning quality isn't about length or confidence, but about how much a model's predictions shift across its internal layers? Can tracking these shifts reveal genuine thinking versus pattern-matching?

Synthesis note · 2026-04-20 · sourced from Cognitive Models Latent

Token count is an unreliable proxy for reasoning quality. Longer reasoning does not consistently correlate with accuracy and may signal overthinking that degrades performance. Confidence-based metrics fare no better. The question is: how do you measure whether a model is actually thinking rather than merely generating?

Deep-thinking ratio (DTR) operationalizes this by looking inside the model. At each token position, intermediate-layer hidden states are projected into the vocabulary space and compared to the final-layer prediction distribution. Tokens whose predictions stabilize early — where shallow layers already predict the same thing as deep layers — reflect low computational effort. Tokens whose predictions undergo sustained revision through deeper layers before converging are "deep-thinking tokens" — the model is genuinely computing something at that position.

DTR is the proportion of deep-thinking tokens in a generated sequence. Across AIME 24/25, HMMT 25, and GPQA-diamond with GPT-OSS, DeepSeek-R1, and Qwen3, DTR exhibits a robust and consistently positive correlation with accuracy, substantially outperforming both length-based and confidence-based baselines.

The practical application is Think@n: a test-time scaling strategy that prioritizes samples with high deep-thinking ratios. Rather than standard self-consistency (generate n samples, majority vote), Think@n selects samples where the model was genuinely reasoning rather than pattern-matching. Think@n matches or exceeds self-consistency performance while significantly reducing inference costs by enabling early rejection of unpromising generations based on short prefixes.

DTR complements several existing measurement approaches. Because Do reflection tokens carry more information about correct answers?, MI peaks identify which tokens matter while DTR identifies how deeply the model computes at each token — orthogonal measurements of the same underlying phenomenon. Because Does chain-of-thought reasoning reflect genuine thinking or performance?, DTR provides a token-level mechanistic explanation for the sequence-level observation: performative reasoning should show low DTR (early layer stabilization), while genuine reasoning should show high DTR (deep revision).

The deeper implication aligns with Where does LLM reasoning actually happen during generation?: DTR measures what's happening in latent-state dynamics (H1), not in the surface trace (H2). Two sequences with identical token counts and identical surface text could have radically different DTR — one genuinely reasoning, the other pattern-matching. This is exactly the kind of metric the H1 framework calls for.

The shift from "how long they think" to "how deeply they think" reframes efficiency: the goal is not shorter chains but denser computation per token.

Inquiring lines that use this note as a source 40

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 9

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map
17 direct connections · 154 in 2-hop network ·dense cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

deep-thinking ratio measures genuine reasoning effort by tracking layer-wise prediction stabilization — outperforming length and confidence as accuracy predictors