SYNTHESIS NOTE

Can models learn to evaluate their own work during training?

Explores whether language models can internalize reward function computation as part of training, transforming external feedback into internal self-assessment capability without slowing inference.

Synthesis note · 2026-02-23 · sourced from Novel Architectures

Current training paradigms terminate learning at the end-of-sequence token, wasting the entire sequence space after model output completion. Post-Completion Learning (PCL) systematically exploits this neglected space. A temporary termination marker (<-- post-completion -->) creates a "post-thinking" space where models continue generating self-assessments and reward predictions during training, while inference stops at the marker — zero additional cost at deployment.

The core innovation is white-box reinforcement learning: the model explicitly learns to understand and compute reward functions, internalizing the reward model as its own evaluation capability. This transforms the model from "passive reward acceptance" (external reward signal tells it what's good) to "active self-evaluation" (it learns to compute quality assessments itself).

Implementation uses dual-track SFT: one track optimizes reasoning, the other optimizes evaluation capability. These are mixed with RL training for multi-objective hybrid optimization. The model learns both to solve problems and to assess its own solutions — but critically, only the problem-solving capability is active during inference. The self-evaluation is internalized during training, shaping the model's generation without requiring explicit self-assessment at inference time.

This addresses three limitations simultaneously:

SFT's passive learning — models learn to mimic demonstrations without developing self-assessment ability
RL's external dependency — reward models are opaque external components; PCL internalizes the evaluation
Self-correction's inference cost — methods like Self-Refine require additional generation passes; PCL's self-evaluation is absorbed into training

The parallel with human cognition is direct: "Humans, after completing a task, often engage in self-reflection and quality assessment — this post-thinking process is crucial for improving future performance." PCL operationalizes this for LLMs.

This connects to What limits how much models can improve themselves? — PCL attempts to close the gap by training the verifier and generator as the same model, with the verification capability internalized rather than external. It also complements Does reflection in reasoning models actually correct errors? — PCL's self-evaluation is trained against ground-truth reward functions, not against the model's own prior outputs, potentially avoiding the confirmatory pattern.

Inquiring lines that read this note 132

This note is a source for these research framings, grouped by the broader line of inquiry each explores. Scan the bold lines of inquiry; follow any specific question forward.

What are the consequences of models training on synthetic data?

Why do reward structures fail to shape long-term agent learning?

How do training priors constrain what context information can override?

How do aggregate reward models systematically exclude minority user preferences?

Does learning community preferences as training rewards operationalize prediction without participation?

How do evaluation biases undermine LLM quality assessment systems?

Does AI fluency substitute for verifiable accuracy in human judgment?

Why can't AI models internalize audiences the way human experts do?

How does AI adoption affect human skill development and labor equality?

Can alternative training methods improve on supervised fine-tuning for language models?

Can next-token prediction alone produce genuine language understanding?

Why do language models reinforce false assumptions instead of correcting them?

How do we evaluate AI systems when user perception misleads actual performance?

How do self-generated feedback mechanisms enable effective model learning?

Why does self-revision increase model confidence while degrading accuracy?

What pretraining choices and baseline capability constrain reinforcement learning gains?

Does self-reflection enable models to reliably correct their errors?

Can model confidence signals reliably improve reasoning quality and calibration?

What properties determine whether reward signals teach genuine reasoning?

Can single-axis benchmarks accurately predict agent deployment success?

How does benchmark performance measure translate to general self-modification ability?

Does conversational format create illusions of genuine AI communication?

Can AI learn to perform attention-seeking surface forms with genuine internal appeal?

How does objective evolution guide discovery better than fixed planning?

How do language models inherit human biases from training data?

Do external perspectives fix the self-evaluation bias in language models?

How do transformer attention mechanisms implement memory and algorithmic functions?

Does bidirectional attention improve language models as universal encoders?

Do language models develop causal world models or rely on statistical patterns?

Which computational strategies best support reasoning in language models?

Can textual gradients generalize natural language feedback across computation graphs?

Can self-supervised signals enable process supervision without human annotation?

How should inference compute be adaptively allocated based on prompt difficulty?

How much inference efficiency do we gain by eliminating self-correction passes?

How should conversational agents balance goal-driven initiative with user control?

How can agents learn to estimate user satisfaction in real-time during conversation?

How should iterative research systems allocate reasoning per search step?

Can generator feedback backpropagate through the entire retrieval pipeline?

Does reinforcement learning teach reasoning or just when to reason?

What makes weaker teacher models effective for stronger student training?

Can self-training drift be prevented by applying student compatibility filtering?

Do language models learn genuine linguistic structure or just surface patterns?

Can ensemble evaluation methods reduce bias more than single judges?

Can evaluation trajectories and interaction histories replace single-answer scoring?

Why do agents confidently report success despite actually failing tasks?

What training objectives could reduce completion bias in autonomous agents?

How can AI agents autonomously learn and transfer skills across tasks?

How can process reward models supervise complex reasoning traces?

Is model self-awareness based on genuine introspection or pattern matching?

What capability tradeoffs emerge when scaling model reasoning abilities?

Can models learn to optimize their own chain-of-thought generation?

What constrains reinforcement learning's ability to expand model reasoning?

How do pairwise self-judgment and internal belief-shift replace verification differently?

How does example difficulty affect learning efficiency in language models?

How can language models extract more value from fewer demonstrations?

Does externalizing cognitive work and state improve agent reliability?

What memory architectures best support persistent reasoning across extended interactions?

How does externalized state affect the long-context bottleneck in language models?

How can identical external performance mask different internal representations?

Why do internal representations differ when external performance matches?

Can language model RL training avoid reward hacking and misalignment?

What makes current learned reward models fail across different domains?

Related concepts in this collection 4

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

16 direct connections · 116 in 2-hop network ·medium cluster Open in graph ↗

Can models learn to evaluate their own work duri… What limits how much models can improve themselves… Does reflection in reasoning models actually corre… Can model confidence work as a reward signal for r… Do reward models actually consider what the prompt…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

What limits how much models can improve themselves? Explores whether self-improvement has fundamental boundaries set by how well models can verify versus generate solutions, and what this means across different task types.
PCL addresses this by co-training generation and verification in the same model
Does reflection in reasoning models actually correct errors? When reasoning models reflect on their answers, do they genuinely fix mistakes, or merely confirm what they already decided? Understanding this matters for designing better training and inference strategies.
PCL's evaluation is trained against external reward functions, potentially avoiding confirmatory bias
Can model confidence work as a reward signal for reasoning? Explores whether using a language model's own confidence scores as training rewards can simultaneously improve reasoning accuracy and restore calibration that standard RLHF damages.
related: both use the model's own assessment capability as a training signal
Do reward models actually consider what the prompt asks? Exploring whether standard reward models evaluate responses based on prompt context or just response quality alone. This matters because if models ignore prompts, they'll fail to align with what users actually want.
PCL internalizes reward computation, potentially avoiding the prompt-insensitivity problem

Can models learn to evaluate their own work during training?

Inquiring lines that read this note 132

Related concepts in this collection 4

Related papers in this collection 8

Search by related questions 4