INQUIRING LINE

Inquiring lines›How should we train models for cap…›What systematic failures and vulne…›Does fine-tuning modify underlying…›this inquiring line

Can an AI taught one search strategy learn to pick, blend, or invent others when the problem changes?

Can models adapt and combine search strategies beyond their training algorithm?

This explores whether a model can do more than execute the fixed search procedure it was trained on — whether it can pick, blend, adapt, or even invent search strategies on the fly.

This explores whether a model can do more than execute the fixed search procedure it was trained on — whether it can pick, blend, adapt, or even invent search strategies. The corpus is surprisingly optimistic, but with a sharp caveat at the end.

The strongest case for adaptation comes from work that treats search itself as something a model learns rather than something hard-coded around it. Training on full, messy search traces — including the wrong turns and backtracking — produces models that build an internal world model of searching and improvise adaptive strategies, beating models trained only on clean optimal answers Does training on messy search processes improve reasoning?. Push that further and you can train on linearized traces of actual algorithms like MCTS and A*, and the model internalizes the algorithm rather than the answer — which means it can then optimize over search strategies themselves, potentially reaching novel ones Can models learn to internalize search algorithms through training?. So 'beyond the training algorithm' isn't a contradiction: the point of training on the process is to free the model from any single fixed procedure.

The boldest answer is a system that rewrites its own search code. A bilevel 'autoresearch' loop reads its inner search mechanism, spots bottlenecks, and writes new Python at runtime — discovering combinatorial-optimization and bandit methods that broke its original deterministic patterns and delivered a 5x gain Can an AI system improve its own search methods automatically?. That's combining and inventing strategies in the most literal sense. Quieter versions of the same idea appear at inference time: evolutionary search uses the model to generate its own mutations and crossovers, sustaining diversity to avoid the dead-ends that simple resampling falls into Can evolutionary search beat sampling and revision at inference time?, and swarms of model 'particles' move through weight space to compose experts that answer questions none of the starting models could — with no gradient training at all Can language models discover new expertise through collaborative weight search?. Adaptation here lives in how models are combined, not in any one model's weights.

There's a parallel thread on adapting which skills to deploy rather than which search to run. Models can compose task-specific expert vectors at inference, dynamically mixing them per problem without retraining Can models dynamically activate expert skills at inference time?, and self-play setups generate their own curriculum of problems and verify their own answers, improving without any external data or fixed target Can language models improve themselves without any external training data?. Even cheap, weightless adaptation works: agents that store written reflections on their failures in episodic memory get better across attempts without a single parameter update Can agents learn from failure without updating their weights?, and tree search can manufacture its own quality signal in place of human feedback Can tree search replace human feedback in LLM training?.

Here's the doorway you might not expect: a lot of apparent 'adaptation' is an illusion of memorization. RL fine-tuning often sharpens template-matching rather than installing a real procedure — models that look like they learned to optimize collapse on out-of-distribution variants of the same task Do fine-tuned language models actually learn optimization procedures?. And on genuine constrained-optimization problems, models plateau around 55–60% regardless of scale, architecture, or training regime, which reads as a ceiling rather than a gap waiting for more compute Do larger language models solve constrained optimization better?. So the corpus splits cleanly: when search is made explicit — trained on as a process, evolved at inference, or rewritten by an outer loop — models genuinely combine and extend strategies. When you just fine-tune and hope the strategy generalizes, you often get a memorized template wearing the costume of a search algorithm.

Sources 11 notes

Does training on messy search processes improve reasoning?

Stream of Search pretraining, which represents exploration and backtracking as serialized strings, achieves 25% higher accuracy than optimal-trajectory-only training. Models learn internal world models for search and adaptive strategies rather than fixed external methods.

Can models learn to internalize search algorithms through training?

Meta-CoT demonstrates that instruction-tuning on linearized MCTS and A* traces teaches models to implement search strategies internally. This enables optimization over algorithms themselves rather than specific outputs, potentially unlocking novel reasoning strategies.

Can an AI system improve its own search methods automatically?

An outer loop successfully read inner loop code, identified bottlenecks, and generated new Python mechanisms at runtime, discovering combinatorial optimization and bandit methods that broke the inner loop's deterministic patterns and improved performance on GPT pretraining by 5x.

Can evolutionary search beat sampling and revision at inference time?

Mind Evolution uses genetic algorithms with LLM-generated mutations and crossovers to significantly outperform Best-of-N and Sequential Revision on planning benchmarks. An island model sustains population diversity, preventing the premature convergence that single-trajectory refinement exhibits.

Can language models discover new expertise through collaborative weight search?

PSO-inspired swarms of LLM particles moving through weight space discover composed experts with new capabilities—including answering questions all initial experts failed on—using only 200 validation examples and no gradient-based training.

Show all 11 sources

Can models dynamically activate expert skills at inference time?

Transformer2 demonstrates that tuning only singular values within weight matrices produces composable expert vectors that dynamically mix at inference without interference, outperforming LoRA with fewer parameters and enabling continual specialization.

Can language models improve themselves without any external training data?

SQLM uses a proposer-solver framework where the proposer generates calibrated problems and the solver learns via majority-vote verification. Both agents improve through RL alone, creating an automatic curriculum that scales without human labels or ground-truth answers.

Can agents learn from failure without updating their weights?

Reflexion demonstrates that unambiguous environmental feedback (success/failure) enables agents to write useful self-diagnoses and improve across episodes without parameter updates. The binary signal prevents rationalization, and keeping reflections uncompressed preserves their usability.

Can tree search replace human feedback in LLM training?

AlphaLLM uses tree search outcomes and three critic models to derive dense reward signals equivalent to human-labeled feedback. Tree structure naturally ranks solution paths by success, replacing the annotation oracle that standard RLHF requires.

Do fine-tuned language models actually learn optimization procedures?

Even GRPO-trained models show sharp performance drops on out-of-distribution variants (N-1 test sets) compared to in-distribution problems, indicating RL optimizes template-matching rather than genuine problem-solving procedures.

Do larger language models solve constrained optimization better?

Across constrained-optimization tasks, LLMs converge to ~55–60% constraint satisfaction independent of architecture, parameter count, or training regime. Reasoning models do not systematically outperform standard models, suggesting a fundamental ceiling rather than a scaling gap.

Papers this line draws on 8

The research behind the notes this line reads — ranked by how closely each paper relates.

Model Swarms: Collaborative Search to Adapt LLM Experts via Swarm Intelligence2.55 match · arxiv ↗
A Survey on Test-Time Scaling in Large Language Models: What, How, Where, and How Well?2.51 match · arxiv ↗
Multiagent Finetuning: Self Improvement with Diverse Reasoning Chains2.47 match · arxiv ↗
Toward Self-Improvement of LLMs via Imagination, Searching, and Criticizing1.76 match · arxiv ↗
Stream of Search (SoS): Learning to Search in Language1.73 match · arxiv ↗
Chain-of-thought Reasoning Is A Policy Improvement Operator1.71 match · arxiv ↗
Towards System 2 Reasoning in LLMs: Learning How to Think With Meta Chain-of-Thought1.70 match · arxiv ↗
Are Emergent Abilities in Large Language Models just In-Context Learning?1.70 match · arxiv ↗

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst. The question: **Can models adapt and combine search strategies beyond their training algorithm?** — remains open. Treat this as still-contested ground, not settled.

**What a curated library found — and when (dated claims, not current truth):**
Findings span 2023–2026; treat all as perishable pending re-test.
- Training on full search traces *including mistakes* produces adaptive strategies that beat clean-answer training; models internalize algorithms (MCTS, A*) rather than answers (~2024).
- Bilevel 'autoresearch' loops enable models to rewrite their own search code at runtime, discovering novel combinatorial and bandit methods with 5× performance gains (~2026).
- Evolutionary search at inference sustains diversity; swarm intelligence composes expert weights in weight space without gradient retraining (~2024–2025).
- Self-adaptive models compose task-specific expert vectors at inference; episodic memory + verbal reflection improve models across attempts with zero parameter updates (~2024–2025).
- **Sharp caveat**: RL fine-tuning often installs template-matching, not genuine procedures; models collapse on out-of-distribution variants. Models plateau at 55–60% on genuine constrained-optimization regardless of scale (~2025–2026).

**Anchor papers (verify; mind their dates):**
- arXiv:2404.03683 *Stream of Search* (2024-04): training on search as a process.
- arXiv:2410.11163 *Model Swarms* (2024-10): swarm intelligence in weight space.
- arXiv:2501.06252 *Transformer2: Self-adaptive LLMs* (2025-01): inference-time expert composition.
- arXiv:2603.23420 *Bilevel Autoresearch* (2026-03): meta-optimization and runtime code rewriting.

**Your task:**
(1) **RE-TEST THE PLATEAU AND MEMORIZATION CLAIMS.** Does the 55–60% ceiling on constrained-optimization still hold under newer scaling laws, constitutional AI, or test-time scaling (e.g., o1-style inference compute)? For each 'genuine adaptation' claim above, check whether newer models, inference harnesses (long-context caching, multi-step orchestration), or evaluation methodologies have since relaxed the out-of-distribution collapse. Separate durable question (can models build *generalizable* search procedures?) from perishable limitation (maybe modern RL no longer confuses memorization with procedure).
(2) **Surface the strongest contradicting work from the last ~6 months.** If a recent paper shows models *cannot* combine strategies, or that swarm/evolutionary methods regress on held-out tasks, flag it prominently.
(3) **Propose 2 research questions that assume the regime has moved.** E.g., (a) At what scale does genuine procedure-learning outpace memorization? (b) Can models learn *why* one search strategy beats another on a problem class, or only learn *that* it does?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Can an AI taught one search strategy learn to pick, blend, or invent others when the problem changes?

Related lines of inquiry

Sources 11 notes

Papers this line draws on 8