Do fine-tuned language models actually learn optimization procedures?
Can RL fine-tuning teach LLMs to solve constraint-optimization problems through genuine reasoning, or does it merely sharpen pattern-matching? Testing on out-of-distribution variants reveals the mechanism.
The constraint-optimization study uses a clean diagnostic to separate procedure from pattern: an N-case test set (in-distribution power-grid topologies) and an N-1 test set (the same problems with one element removed, putting them out of distribution while keeping the structure recognizable). A model running the actual procedure should perform comparably on both. A model running pattern-match should perform worse on N-1.
Even under GRPO and constraint-satisfaction-reward training, models degrade markedly on N-1. The conclusion is that RL on outcome-based rewards does not install the missing procedure — it sharpens the template-matching strategy along the in-distribution axis. The model gets better at recognizing patterns it has seen and worse, relatively, at adapting to perturbed structure.
This is methodologically important because it provides a probe that other reasoning evaluations lack. Most benchmarks cannot distinguish "the model solved this" from "the model recognized this." The N / N-1 comparison forces the distinction by holding the problem class fixed while perturbing the instance. The drop is the memorization signature.
For practitioners, the diagnostic generalizes. Wherever a deployment cares whether a model is computing or recalling — clinical reasoning, legal-statute reasoning, scientific problem-solving — building an "N-1" counterpart of the canonical test set is a cheap way to surface memorization. The structure-shift probe is more informative than headline accuracy on the canonical set.
Inquiring lines that use this note as a source 92
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- Can communication problems and optimization problems be addressed with the same alignment approaches?
- How do unstated constraints become invisible to training data distributions?
- Can benchmarks designed for shortcut learning detect heuristic override failures?
- How do different LLM integration paradigms affect inheritance of pretraining biases?
- How should benchmarks test whether models fit algorithms or patterns?
- Can universal function approximators be expensive to learn in practice?
- Can instruction tuning succeed without explicit task understanding?
- Does correct model behavior guarantee internal alignment of learned objectives?
- How can safety-aligned parameters be protected during user-specific fine-tuning?
- What hidden costs emerge when you fine-tune models for a single domain?
- Can prompt optimization alone inject knowledge models don't already have?
- How much alignment data does a language model actually need to specialize well?
- What breaks when you apply reinforcement learning after supervised fine-tuning?
- How does fitness-proportional selection guide LLM recombination in unstructured solution spaces?
- Can fine-tuning ever teach semantic inference instead of amplifying training shortcuts?
- How do LLMs compress specific expert knowledge into median abstraction?
- Why does NLI fine-tuning amplify frequency bias instead of teaching inference?
- How does preference-based training compare to supervised fine-tuning for function calling?
- Does supervised fine-tuning improve accuracy while damaging the quality of reasoning?
- How does self-distillation differ from standard fine-tuning approaches?
- Does reasoning fine-tuning actually reduce a model's ability to abstain?
- Does fine-tuning on NLI tasks amplify or reduce frequency bias in language models?
- Can instance seeds work for tasks beyond language understanding benchmarks?
- Can prompt optimization inject genuinely new knowledge into a model?
- Why does fine-tuning change how models process retrieved context?
- How does behavioral fine-tuning differ from factual knowledge encoding in models?
- Do LLMs rely on surface heuristics instead of learning recursive grammar rules?
- Why do rare complex structures in training data harm LLM generalization?
- Why does fine-tuning improve some capabilities while degrading others?
- Can LLMs improve at simple deduction through different training approaches?
- Why does fine-tuning fail to remove temporal contamination from pretraining?
- Do instruction-tuned models learn tasks or just output format distributions?
- Can prompt optimization or fine-tuning inject knowledge models do not already contain?
- Why does genetic programming outperform direct LLM generation by 86 percent?
- What separates pattern matching from genuine language understanding?
- Why do production systems optimize for three model classes instead of foundation models?
- Why do fine-tuned models fail outside their specialized domains?
- What knowledge can prompt optimization actually activate in trained models?
- Does fine-tuning actually change model capabilities or only output distribution?
- How do language agents become optimizable computational graphs automatically?
- Why do large language models outperform fine-tuned models once repeated items are removed?
- Can smaller models achieve domain expertise through focused RL training?
- Why does supervised fine-tuning degrade reasoning quality despite raising accuracy?
- Which recipe choices determine the asymptotic ceiling in RL training?
- Can RL format selection explain performance gains attributed to algorithmic improvements?
- How does task-oriented fine-tuning compare to preference tuning methods?
- How does data quality mismatch create reasoning degradation in supervised fine-tuning?
- What filtering criteria best identify student-compatible refinements from teacher models?
- Does reasoning fine-tuning actually harm a model's ability to abstain?
- Can reasoning fine-tuning improve both capability and instruction compliance together?
- Why does reasoning fine-tuning reduce a model's ability to abstain?
- Can you control LLM reasoning strategy without fine-tuning the model?
- What mechanism causes LLMs to plateau on numerical optimization tasks?
- Why do reasoning models fail to improve constrained optimization performance?
- What concrete problems do LLMs solve at the computational level?
- How do recommender metrics drive LLM query refinement in closed-loop training?
- Does supervised fine-tuning improve reasoning or just response formatting?
- How do out-of-distribution tests reveal that optimization learning is memorization?
- How does LLM simulation of APIs avoid instability without sacrificing training signal?
- What limits the effectiveness of formal language pretraining on transformer architectures?
- Does RL amplify existing reasoning or create genuinely new computational strategies?
- What happens to base model capabilities when you apply finetuning?
- How do retrieval and fine-tuning trade off flexibility against training cost?
- How should skill libraries coordinate with gradient-based weight optimization?
- Does fine-tuning push models toward reasoning shortcuts that bypass the chain entirely?
- Where does skill extraction fail compared to genuine model adaptation?
- Can memory-based adaptation and gradient fine-tuning operate on complementary timescales?
- Can models adapt and combine search strategies beyond their training algorithm?
- How much does pretraining quality affect the modularity of fine-tuned models?
- Does preference tuning help or hurt the exploration of solution spaces in code?
- Why does step-level expert alignment work when outcome-only RL fails?
- What makes supervised fine-tuning worsen RL exploration later?
- Does the pretrained model prior limit RL search capability more than the optimization algorithm itself?
- Does fine-tuning a small model match fine-tuning a large one?
- Can partial solution traces convert unproductive hard samples into learnable training data?
- How do verifier-free RL patterns differ from traditional RLHF approaches?
- Can categorical correctness signals stop dense optimizers from finding loopholes?
- Can trained models encode programs more complex than their data-generating process?
- Why does prompt optimization alone fail to inject genuinely new knowledge?
- How does the pretraining distribution shape what LLMs find hard?
- What constraint satisfaction rate do LLMs achieve at scale?
- When does RL discover genuinely novel reasoning strategies versus timing optimization?
- How do sparse parameter updates enable when-not-how training to work?
- Can single-problem fine-tuning match full RL pipeline reasoning gains?
- Can approximate or noisy reference answers work for RL-based reasoning training?
- Can LLMs simultaneously reason and optimize their own modules?
- Which finetuning method works best across different task and data regimes?
- How do finetuning and pretraining improvements differ in their effects on model capabilities?
- How do normalization and input injection control emergence of fixed points?
- Can instruction prompts reliably steer an LLM judge toward specific alignment targets?
- What empirical evidence supports the Learning Law on real language models?
- How does preference learning differ from supervised finetuning for reasoning?
Related concepts in this collection 3
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Do large language models actually perform iterative optimization?
Explores whether LLMs execute genuine numerical procedures like Newton-Raphson or instead pattern-match to memorized solution templates when solving constrained optimization problems.
same paper, the mechanism this diagnostic exposes
-
Do larger language models solve constrained optimization better?
Explores whether scaling LLMs—through more parameters, better training, or reasoning extensions—improves their ability to satisfy constraints in real optimization problems like power grids and portfolios.
same paper, the parent finding
-
Does supervised fine-tuning actually improve reasoning on optimization problems?
When SFT boosts benchmark scores on constraint-optimization tasks, does it genuinely improve the model's ability to find feasible solutions, or just its ability to format answers convincingly?
same paper, complementary memorization signature
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Echo Chamber: RL Post-training Amplifies Behaviors Learned in Pretraining
- Train Long, Think Short: Curriculum Learning for Efficient Reasoning
- ProRL: Prolonged Reinforcement Learning Expands Reasoning Boundaries in Large Language Models
- Stop Anthropomorphizing Intermediate Tokens as Reasoning/Thinking Traces!
- On the Impact of Fine-Tuning on Chain-of-Thought Reasoning
- Does Fine-Tuning LLMs on New Knowledge Encourage Hallucinations?
- Efficiently Learning at Test-Time: Active Fine-Tuning of LLMs
- Are Emergent Abilities in Large Language Models just In-Context Learning?
Original note title
N-1 out-of-distribution tests reveal that RL fine-tuned LLMs still rely on memorization for optimization problems