SYNTHESIS NOTE
Training, RL, and Test-Time Scaling Reasoning, Retrieval, and Evaluation Model Architecture and Internals

What reasoning features does each difficulty level reinforce?

When models train on problems of different difficulty, do they build the same internal reasoning machinery or different kinds? This matters because accuracy gains alone hide what's actually being learned.

Synthesis note · 2026-05-28 · sourced from RLVR
What does reward learning actually do to model reasoning?

Reward curves and advantage magnitudes tell you whether training is improving accuracy, but they are silent about what kind of reasoning is being reinforced. Reading RLVR through a Temporal Sparse Autoencoder — extracting sparse reasoning features from activations along the reasoning trajectory — exposes a structured story that the scalar signals hide. Difficulty does not just change how much the model learns; it changes which internal features get strengthened versus suppressed.

The breakdown: easy problems mainly reinforce direct-answer and basic-computation features while actively suppressing deliberative-reasoning features — the model learns to shortcut because shortcutting works. Hard problems activate reasoning-related features, but those features become useful only on the rare successful trajectory, so most hard-sample updates do not consolidate them. Medium-difficulty problems provide a balanced signal, strengthening both computation and multi-step reasoning features at once. The same accuracy gain can therefore correspond to opposite internal changes depending on the difficulty of the data producing it.

Why it matters: it warns that benchmark improvement is an ambiguous summary statistic. Two RLVR runs can post similar accuracy gains while one has built multi-step reasoning machinery and the other has sharpened answer-shortcutting and let deliberation atrophy. The feature-level view is what distinguishes them, and it is the basis for difficulty-adaptive interventions that target feature consolidation directly (e.g., feature-guided training signals). The connection to interpretability work is direct: this is the same SAE-feature lens that lets you steer or read reasoning, now used to audit what a training regime is silently rewarding. The limitation is that T-SAE features are themselves a learned, imperfect decomposition — the "reasoning feature" labels are interpretive, not ground truth.

Inquiring lines that use this note as a source 9

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 4

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map
12 direct connections · 118 in 2-hop network ·dense cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

different difficulty levels selectively reinforce or suppress distinct reasoning features invisible from advantage signals alone