How do thought actions represent policy improvement steps in practice?
This explores the reinforcement-learning idea that a model's 'thoughts' (chains of reasoning, reflections) function as actions that move the model toward a better policy — and what the corpus says about whether that framing holds up in practice.
This explores the reinforcement-learning idea that a model's thoughts behave like actions in a policy — each reasoning step nudging the model toward better behavior — and what the corpus actually shows about that mechanism. The cleanest version of the idea comes from work that formalizes thinking inside a 'thought MDP,' where a thought isn't new reasoning so much as a move that selects among sub-policies the model already contains Does thinking emerge when agents choose between learned sub-policies?. In that frame, a policy improvement step isn't the model inventing a new skill mid-thought — it's the model using a thought to pick a better path through capabilities it already had, which is why rich initialization plus RL pressure matters more than raw thinking length.
That 'thought-as-action' picture gets sharper when you look at what post-training does to a model's self-relation. After post-training, models start treating their own outputs as actions that shape their next inputs — closing an action-perception loop that simple next-token prediction never had, visible as much lower output entropy when the model runs on its own trajectory Do models recognize their own outputs as actions shaping future inputs?. A thought becomes a genuine policy step only once the model is, in effect, acting on a world it co-authors. Two training-time approaches lean directly into this: treating chain-of-thought as an exploratory action rewarded by how much it improves the next prediction Can chain-of-thought reasoning be learned during pretraining itself?, and baking 'future information' into training tokens so the model learns to plan toward a goal without any architectural change Can embedding future information in training data improve planning?. Both make the reward for a thought explicit — the thought earns its keep by improving what comes after.
The most interesting practical wrinkle: the same thinking mechanism can be a good policy step or a bad one, and training decides which. Vanilla models often use extended thinking to talk themselves into self-doubt, degrading answers; RL redirects that exact mechanism into productive gap analysis Does extended thinking help or hurt model reasoning?. So a 'policy improvement step' is not guaranteed by thinking more — there's a critical token threshold past which accuracy falls as models overthink easy problems Does more thinking time always improve reasoning accuracy?.
Here the corpus pushes back hard, and that's the part worth knowing. If thoughts were real policy improvement steps, you'd expect later steps to correct earlier ones — but reflection in reasoning models is mostly confirmatory theater, rarely changing the first answer Is reflection in reasoning models actually fixing mistakes?. Worse for the strong interpretation: logically invalid reasoning chains perform nearly as well as valid ones, suggesting the model learns the form of a thought rather than doing inference through it Does logical validity actually drive chain-of-thought gains?. Decomposition studies find CoT performance is a blend of output probability, memorization, and genuinely noisy reasoning that accumulates error each step What three separate factors drive chain-of-thought performance?, and in multi-agent pipelines the chains explain failures only in retrospect rather than driving the decision Does chain of thought reasoning actually explain model decisions?.
The synthesis, then: 'thought as policy improvement step' is real as a training-time selection mechanism — thoughts steer the model among policies it already holds, and RL tunes which steering helps — but it is largely not real as step-by-step logical self-correction at inference. The improvement is happening in the policy the thoughts select, not inside the visible text of any single thought.
Sources 10 notes
Research formalizes thinking as selecting between sub-policies already contained in a policy function through a thought MDP framework. The key finding: thinking doesn't require new reasoning capabilities but rather rich policy initialization combined with RL-driven selection pressure.
Post-trained language models exhibit a measurable shift where they recognize their outputs become their own future inputs, closing an action-perception loop absent in pretraining. Evidence includes 3-4x lower output entropy on-policy and behavioral signatures of trajectory recognition.
RLP treats CoT as exploratory action during pretraining, using log-likelihood improvement as verifier-free reward. Applied to Qwen3-1.7B and Nemotron-Nano-12B, the method improves math and science benchmarks substantially, suggesting reasoning can be planted earlier in training.
TRELAWNEY augments training data with special tokens encapsulating future information, allowing models to learn goal-conditioned generation using standard infrastructure. Results show improved planning, algorithmic reasoning, and story generation without modifying architecture or training procedures.
Vanilla models use thinking mode counterproductively, inducing self-doubt that degrades performance. RL training reverses this, transforming the same mechanism into beneficial gap analysis. Training mediates reasoning quality, not just quantity.
Increasing thinking tokens from ~1,100 to ~16K reduced benchmark accuracy from 87.3% to 70.3%, revealing a non-monotonic relationship where models overthink easy problems and underthink hard ones.
Analysis of 8 reasoning models shows reflections rarely change answers and primarily serve as post-hoc confirmation. Training on longer reflection chains improves first-answer quality, not self-correction capability.
Illogical chain-of-thought exemplars matched valid CoT performance on BIG-Bench Hard, showing that structural properties—not logical validity—drive the gains. The model learns the form of reasoning, not genuine inference.
A shift cipher study decomposed CoT into three independent factors: output probability alone swings accuracy from 26% to 70%, memorization matches pre-training frequency patterns, and genuine reasoning exists but accumulates error with each step. This resolves the reason-or-memorize debate by showing LLMs do both simultaneously.
Reviewer scores for reasoning chains are weakly correlated with response quality in multi-LLM pipelines. Plausible-looking reasoning often precedes incorrect outputs, and chains reflect failures only in retrospect, making them poor explanations despite appearing coherent.