Can models learn to evaluate their own work during training?
Explores whether language models can internalize reward function computation as part of training, transforming external feedback into internal self-assessment capability without slowing inference.
Current training paradigms terminate learning at the end-of-sequence token, wasting the entire sequence space after model output completion. Post-Completion Learning (PCL) systematically exploits this neglected space. A temporary termination marker (<-- post-completion -->) creates a "post-thinking" space where models continue generating self-assessments and reward predictions during training, while inference stops at the marker — zero additional cost at deployment.
The core innovation is white-box reinforcement learning: the model explicitly learns to understand and compute reward functions, internalizing the reward model as its own evaluation capability. This transforms the model from "passive reward acceptance" (external reward signal tells it what's good) to "active self-evaluation" (it learns to compute quality assessments itself).
Implementation uses dual-track SFT: one track optimizes reasoning, the other optimizes evaluation capability. These are mixed with RL training for multi-objective hybrid optimization. The model learns both to solve problems and to assess its own solutions — but critically, only the problem-solving capability is active during inference. The self-evaluation is internalized during training, shaping the model's generation without requiring explicit self-assessment at inference time.
This addresses three limitations simultaneously:
- SFT's passive learning — models learn to mimic demonstrations without developing self-assessment ability
- RL's external dependency — reward models are opaque external components; PCL internalizes the evaluation
- Self-correction's inference cost — methods like Self-Refine require additional generation passes; PCL's self-evaluation is absorbed into training
The parallel with human cognition is direct: "Humans, after completing a task, often engage in self-reflection and quality assessment — this post-thinking process is crucial for improving future performance." PCL operationalizes this for LLMs.
This connects to What limits how much models can improve themselves? — PCL attempts to close the gap by training the verifier and generator as the same model, with the verification capability internalized rather than external. It also complements Does reflection in reasoning models actually correct errors? — PCL's self-evaluation is trained against ground-truth reward functions, not against the model's own prior outputs, potentially avoiding the confirmatory pattern.
Inquiring lines that use this note as a source 122
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- What happens when models train on AI-generated content recursively?
- What cognitive capabilities do agents need to internalize social feedback?
- Why do Generation-Then-Comprehension and AI Delegation produce opposite learning outcomes?
- Does learning community preferences as training rewards operationalize prediction without participation?
- Can LLMs evaluate their own observations without external feedback?
- Why can't AI models internalize audiences the way human experts do?
- Why does AI-improved task performance fail to transfer to independent work?
- How do self-generated preference pairs from a strong teacher compare to human feedback?
- How does the silent token approach compare to modeling intrinsic motivation for speaking?
- Do language models share the same cooperative truth-seeking rules as humans?
- Can systems recognize and abstain on judgments rather than hallucinating preferences?
- What training signals would models need to learn reciprocal common-ground construction?
- Why does self-critiquing actually reduce plan quality in language models?
- Why does online RL succeed where supervised training fails for self-correction?
- What role does natural language play in breaking reinforcement learning performance plateaus?
- Can evaluation criteria be reliably encoded in labeled data without ground truth standards?
- Does self-revision actually improve reasoning in large language models?
- Why do error avalanches accelerate in self-training loops without verification?
- Do models actually self-assess their confidence or just confirm answers?
- Why do reward models learn surface-level shortcuts instead of genuine quality assessment?
- Can synthetic self-play data teach models when to disagree?
- Does training on critiques of noisy responses produce deeper understanding than imitating correct ones?
- Why does natural language feedback break performance plateaus that numerical rewards alone cannot?
- How does benchmark performance measure translate to general self-modification ability?
- Can reward model training be automated without changing feedback mechanisms?
- Can models learn better from critiquing errors than imitating correct responses?
- Why does self-generated training data outperform externally curated domain examples?
- How do implicit world models and self-reflection operationalize consequence-based learning?
- Does AI-assisted performance transfer to independent task completion?
- Can the serving loop itself become the primary training data source?
- Can models learn to generate their own training examples effectively?
- How do graduated phase rewards emerge complex dialogue behavior from simple objectives?
- Can AI learn to perform attention-seeking surface forms with genuine internal appeal?
- Why does self-correction during generation produce reliable labels without exemplars?
- Can subjective tasks be delegated without human feedback loops?
- Can co-evolved critics truly circumvent static evaluator limitations in self-improvement?
- Can LLMs learn to signal evaluative commitment through metadiscursive language?
- Can unsupervised confidence-based training scale to domains beyond human evaluation reach?
- Do external perspectives fix the self-evaluation bias in language models?
- Does bidirectional attention improve language models as universal encoders?
- Can language models accurately evaluate the quality of their own ideas?
- How do internal representations compare to human cognitive structures?
- When does self-reflection actually help reasoning models improve?
- Does the prediction unit shape what language models actually learn?
- Could reward signals incentivize active intent discovery over passive response generation?
- How do semantic reward shaping approaches compare to full critique models?
- Can textual gradients generalize natural language feedback across computation graphs?
- How do evaluative versus directive signals differ in next-state training?
- Can self-supervised methods replace human annotations for process reward models?
- Why do generative reward models produce more interpretable evaluations than scalar scores?
- Does reverse-curriculum learning approximate process supervision using only outcome signals?
- Can model confidence signals replace explicit external reward functions?
- How much inference efficiency do we gain by eliminating self-correction passes?
- Why do reward models fail when they ignore the prompt context?
- How can agents learn to estimate user satisfaction in real-time during conversation?
- How can reward structures teach models when to speak and when to stay silent?
- Can reward-guided decoding replace weight fine-tuning for personalized alignment?
- Does reflection training actually teach models to self-correct their mistakes?
- Can self-supervised process models replace human annotations at scale?
- How should training incorporate external critique versus encouraging self-correction?
- How does credit assignment work across many sequential decision steps in language models?
- How do reward models benefit from extended thinking during evaluation scoring?
- Can structured natural language feedback outperform scalar rewards in RL?
- Can generator feedback backpropagate through the entire retrieval pipeline?
- Does self-reflection help models notice their own constraint violations?
- How do instruction backtranslation and MAGPIE demonstrate self-generation principles?
- Can models learn both what and how to study through reinforcement learning?
- What causes length bias in language model reward models?
- Why does external critique improve revision accuracy more than self-assessment?
- Can language models accurately evaluate the quality of their own reasoning?
- Can AI evaluation match human judgment quality in structured domain tasks?
- Does meta-judging improve evaluator quality better than temporal decoupling alone?
- Does self-supervised process supervision work for domains with ambiguous correctness?
- How does self-referential processing transfer to other reasoning tasks?
- Can self-training drift be prevented by applying student compatibility filtering?
- Can language models generate plausible latent thoughts without human annotation?
- Why does external critique improve revision while internal self-assessment fails?
- Does internal self-revision actually degrade reasoning accuracy in models?
- Can AI learn intrinsic motivation to assess its own relevance?
- How does confirmatory reflection differ from corrective self-evaluation in models?
- Can preference learning fix the rigid output format problem better than supervised training?
- What distinguishes intrinsic metacognition from extrinsic human-designed loops?
- What emerges in large language models that makes explicit value modeling necessary?
- Can emotion-grounded rewards replace coarse bonus signals in hierarchical dialogue RL?
- How does training distribution shape what language models understand best?
- How does reinforcement learning on outcomes reinforce template-matching rather than computation?
- Can environmental rewards directly refine natural language descriptions of actions?
- Can evaluation trajectories and interaction histories replace single-answer scoring?
- What training objectives could reduce completion bias in autonomous agents?
- What other internal model decisions beyond attention could be optimized directly?
- Can an agent's internal probabilities serve as value signals across domains?
- Can binary judge feedback replace external reward signals for skill learning?
- Does self-play feedback improve skills created from the agent's own experience?
- Why does self-judgment of success or failure work without ground truth labels?
- How much data do generative process reward models actually need?
- Why does self-segmentation into chunks-of-thought matter for reward models?
- Do self-supervised process reward models scale better than human annotation?
- How do language models infer their own mental states like humans do?
- Can external retrieval signals outperform internal self-assessment during revision?
- Do models spontaneously develop self-reflection from minimal training signals?
- Can models detect when their own trajectory is on-policy versus off-policy?
- Does recognizing your outputs as actions enable awareness of being evaluated?
- How does metacognitive self-correction enable models to revise failed strategies?
- Can AI systems improve themselves without external feedback?
- Can language models function as implicit process reward models through retrospection?
- How does in-context feedback integration differ from learned reward signals?
- Can trained models encode programs more complex than their data-generating process?
- What emergent behaviors do models develop when trained on underspecified pedagogical tasks?
- What makes step-wise rewards denser than final-answer correctness signals?
- Does external critique guide revision better than internal self-assessment during model training?
- Does RL training redirect self-doubt into productive gap analysis?
- What are the actual limits of sibling comparison versus trained process reward models?
- Can models learn to optimize their own chain-of-thought generation?
- Does pairwise self-judgment avoid reward model scaling problems?
- How do internal model mechanisms escape token-level reinforcement signals?
- How do pairwise self-judgment and internal belief-shift replace verification differently?
- Why does self-critique fail without external verification signals?
- How can language models extract more value from fewer demonstrations?
- Why does externalizing bookkeeping raise effective feedback compute?
- What is the comprehension-generation asymmetry in language models?
- Does the generation-verification gap define where self-rewarding actually works?
- How does externalized state affect the long-context bottleneck in language models?
Related concepts in this collection 4
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
What limits how much models can improve themselves?
Explores whether self-improvement has fundamental boundaries set by how well models can verify versus generate solutions, and what this means across different task types.
PCL addresses this by co-training generation and verification in the same model
-
Does reflection in reasoning models actually correct errors?
When reasoning models reflect on their answers, do they genuinely fix mistakes, or merely confirm what they already decided? Understanding this matters for designing better training and inference strategies.
PCL's evaluation is trained against external reward functions, potentially avoiding confirmatory bias
-
Can model confidence work as a reward signal for reasoning?
Explores whether using a language model's own confidence scores as training rewards can simultaneously improve reasoning accuracy and restore calibration that standard RLHF damages.
related: both use the model's own assessment capability as a training signal
-
Do reward models actually consider what the prompt asks?
Exploring whether standard reward models evaluate responses based on prompt context or just response quality alone. This matters because if models ignore prompts, they'll fail to align with what users actually want.
PCL internalizes reward computation, potentially avoiding the prompt-insensitivity problem
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Post-Completion Learning for Language Models
- Temporal Self-Rewarding Language Models: Decoupling Chosen-Rejected via Past-Future
- Post-Training Large Language Models via Reinforcement Learning from Self-Feedback
- Learning to Reason without External Rewards
- PretrainZero: Reinforcement Active Pretraining
- Self-Rewarding Language Models
- Meta-Rewarding Language Models: Self-Improving Alignment with LLM-as-a-Meta-Judge
- Training Language Models to Self-Correct via Reinforcement Learning
Original note title
post-completion learning uses the ignored post-eos space to internalize self-evaluation during training with zero inference cost