SYNTHESIS NOTE

Topics›Reasoning Architectures›this note

Do fine-tuned language models actually learn optimization procedures?

Can RL fine-tuning teach LLMs to solve constraint-optimization problems through genuine reasoning, or does it merely sharpen pattern-matching? Testing on out-of-distribution variants reveals the mechanism.

Synthesis note · 2026-05-18 · sourced from Reasoning Architectures

The constraint-optimization study uses a clean diagnostic to separate procedure from pattern: an N-case test set (in-distribution power-grid topologies) and an N-1 test set (the same problems with one element removed, putting them out of distribution while keeping the structure recognizable). A model running the actual procedure should perform comparably on both. A model running pattern-match should perform worse on N-1.

Even under GRPO and constraint-satisfaction-reward training, models degrade markedly on N-1. The conclusion is that RL on outcome-based rewards does not install the missing procedure — it sharpens the template-matching strategy along the in-distribution axis. The model gets better at recognizing patterns it has seen and worse, relatively, at adapting to perturbed structure.

This is methodologically important because it provides a probe that other reasoning evaluations lack. Most benchmarks cannot distinguish "the model solved this" from "the model recognized this." The N / N-1 comparison forces the distinction by holding the problem class fixed while perturbing the instance. The drop is the memorization signature.

For practitioners, the diagnostic generalizes. Wherever a deployment cares whether a model is computing or recalling — clinical reasoning, legal-statute reasoning, scientific problem-solving — building an "N-1" counterpart of the canonical test set is a cheap way to surface memorization. The structure-shift probe is more informative than headline accuracy on the canonical set.

Inquiring lines that read this note 97

This note is a source for these research framings, grouped by the broader line of inquiry each explores. Scan the bold lines of inquiry; follow any specific question forward.

How can AI alignment serve diverse human preferences at scale?

Can communication problems and optimization problems be addressed with the same alignment approaches?

How can identical external performance mask different internal representations?

How can AI systems learn from failures without cascading errors?

Can benchmarks designed for shortcut learning detect heuristic override failures?

How should we design LLM systems to maintain alignment and control?

How do different LLM integration paradigms affect inheritance of pretraining biases?

Why do benchmark improvements fail to reflect actual reasoning quality?

How should benchmarks test whether models fit algorithms or patterns?

How does example difficulty affect learning efficiency in language models?

Can self-supervised signals enable process supervision without human annotation?

Can instruction tuning succeed without explicit task understanding?

Does alignment training create blind spots in detecting genuine safety threats?

Does fine-tuning modify underlying model capabilities or only behavioral outputs?

Can prompting inject entirely new knowledge into language models?

Do language models learn genuine linguistic structure or just surface patterns?

What pretraining choices and baseline capability constrain reinforcement learning gains?

Which computational strategies best support reasoning in language models?

How does fitness-proportional selection guide LLM recombination in unstructured solution spaces?

Why does supervised fine-tuning improve accuracy while degrading reasoning quality?

How do neural networks separate factual knowledge from reasoning abilities?

How do LLMs compress specific expert knowledge into median abstraction?

How do training data properties shape reasoning capability development?

Can alternative training methods improve on supervised fine-tuning for language models?

How does preference-based training compare to supervised fine-tuning for function calling?

What are the consequences of models training on synthetic data?

What capability tradeoffs emerge when scaling model reasoning abilities?

What critical LLM failures do standard benchmarks hide?

How faithfully do LLMs reflect their actual reasoning in outputs and explanations?

How do training priors constrain what context information can override?

Do instruction-tuned models learn tasks or just output format distributions?

When does architectural design matter more than raw model capacity?

Why do production systems optimize for three model classes instead of foundation models?

Does domain specialization cause models to lose capabilities elsewhere?

Why do fine-tuned models fail outside their specialized domains?

How do multi-agent systems achieve genuine cooperation and reasoning?

How do language agents become optimizable computational graphs automatically?

Why do continual learning scenarios trigger catastrophic forgetting and interference?

What makes weaker teacher models effective for stronger student training?

What filtering criteria best identify student-compatible refinements from teacher models?

Do language models perform faithful symbolic reasoning independent of semantic grounding?

What concrete problems do LLMs solve at the computational level?

How can LLM recommenders match or exceed collaborative filtering performance?

How do recommender metrics drive LLM query refinement in closed-loop training?

How does memorization interact with learning and generalization?

How do out-of-distribution tests reveal that optimization learning is memorization?

Does recurrence enable reasoning capabilities that fixed-depth transformers cannot achieve?

What limits the effectiveness of formal language pretraining on transformer architectures?

Does reinforcement learning teach reasoning or just when to reason?

How do knowledge injection methods compare across cost and effectiveness?

Why does finetuning cause catastrophic forgetting of model capabilities?

How can process reward models supervise complex reasoning traces?

Why does step-level expert alignment work when outcome-only RL fails?

What constrains reinforcement learning's ability to expand model reasoning?

How do verifier-free RL patterns differ from traditional RLHF approaches?

Can language model RL training avoid reward hacking and misalignment?

Can categorical correctness signals stop dense optimizers from finding loopholes?

Why do self-improving systems struggle without clear external performance metrics?

How do normalization and input injection control emergence of fixed points?

Can prompting strategies overcome LLM biases without model fine-tuning?

Can instruction prompts reliably steer an LLM judge toward specific alignment targets?

Do language models develop causal world models or rely on statistical patterns?

What empirical evidence supports the Learning Law on real language models?

How should personalization be implemented to improve AI assistant effectiveness?

Why does naive personalization fine-tuning destroy generalist reasoning?

Can next-token prediction alone produce genuine language understanding?

What makes fixed-point convergence better than learned halt tokens?

Related concepts in this collection 3

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

12 direct connections · 102 in 2-hop network ·medium cluster Open in graph ↗

Do fine-tuned language models actually learn opt… Do large language models actually perform iterativ… Do larger language models solve constrained optimi… Does supervised fine-tuning actually improve reaso…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Echo Chamber: RL Post-training Amplifies Behaviors Learned in Pretraining0.88 match · arxiv ↗
Train Long, Think Short: Curriculum Learning for Efficient Reasoning0.87 match · arxiv ↗
ProRL: Prolonged Reinforcement Learning Expands Reasoning Boundaries in Large Language Models0.86 match · arxiv ↗
Stop Anthropomorphizing Intermediate Tokens as Reasoning/Thinking Traces!0.86 match · arxiv ↗
On the Impact of Fine-Tuning on Chain-of-Thought Reasoning0.86 match · arxiv ↗
Does Fine-Tuning LLMs on New Knowledge Encourage Hallucinations?0.85 match · arxiv ↗
Efficiently Learning at Test-Time: Active Fine-Tuning of LLMs0.85 match · arxiv ↗
Are Emergent Abilities in Large Language Models just In-Context Learning?0.85 match · arxiv ↗

Original note title

N-1 out-of-distribution tests reveal that RL fine-tuned LLMs still rely on memorization for optimization problems