Can you control LLM reasoning strategy without fine-tuning the model?
This explores whether you can shape *how* a model reasons — its strategy, its sequence of steps — using external scaffolding (prompts, algorithms, tools) instead of changing the model's weights through training.
This explores whether reasoning strategy is something you steer from the outside — through prompts, control flow, and tool calls — rather than something you have to bake in by fine-tuning. The corpus answers an emphatic yes, and more interestingly, it suggests external control often works *better* than training. The cleanest demonstration is cognitive tools: four reasoning operations implemented as sandboxed model calls lifted GPT-4.1 on a hard math benchmark from 26.7% to 43.3% with no RL training at all Can modular cognitive tools unlock reasoning without training?. The key insight there is that the capability already existed inside the model — what was missing was *structure*. Plain prompting can't guarantee that the model actually isolates one operation at a time, but wrapping operations as discrete tool calls enforces it.
The same logic drives LLM Programs, which embed the model inside an explicit algorithm that manages state and feeds each call only the context relevant to its current step Can algorithms control LLM reasoning better than LLMs alone?. Notice the shared mechanism across both: control comes from *information hiding and modular isolation*, not from teaching the model anything new. You're not changing what the model knows — you're changing what it sees and when. This reframes "reasoning strategy" as an orchestration problem rather than a weights problem.
What makes the no-fine-tuning case stronger is the corpus's skepticism about whether fine-tuning installs reasoning in the first place. RL fine-tuning, it turns out, tends to *sharpen memorization* rather than install genuine procedures — GRPO-trained models collapse on out-of-distribution variants of problems they ace in-distribution, revealing template-matching dressed up as reasoning Do fine-tuned language models actually learn optimization procedures?. And turning a model into an action-taking agent isn't a retraining job either: it requires transforming the whole pipeline — datasets, grounding, memory, tools, safety — with the surrounding harness, not the weights, deciding whether actions are grounded or hallucinated Can you turn an LLM into an agent by just fine-tuning?. The center of gravity keeps shifting from the model to the system around it.
But external control has real ceilings, and this is the part you might not expect. Models arrive with their *own* innate strategic styles — across 22 models, GPT-o1 defaults to minimax, DeepSeek-R1 to trust-based reasoning, o3-mini to belief-anticipation — and these profiles track the model, not just the prompt Do large language models use one reasoning style or many?. So scaffolding steers a model that already has a temperament. Worse, even when you orchestrate the steps, reasoning models tend to *wander* — exploring unsystematically rather than searching, which makes success drop exponentially as problems get deeper Why do reasoning LLMs fail at deeper problem solving?. External structure can elicit and channel latent ability, but it can't conjure a search discipline the model lacks.
The deeper limit is what kind of reasoning lives in there to be controlled. When semantic content is stripped from a task, model performance collapses even with the correct rules sitting in context — these are semantic associators, not symbolic logicians Do large language models reason symbolically or semantically?. So the honest framing is this: you can substantially control reasoning *strategy* without fine-tuning — sequencing, isolation, tool use, which capability gets elicited — but you're conducting an orchestra whose instruments are fixed. Scaffolding redirects existing capability; it doesn't manufacture new kinds of it.
Sources 7 notes
Four cognitive tools implemented as sandboxed LLM calls improved GPT-4.1 on AIME2024 from 26.7% to 43.3% without any RL training. Modularity enforces operation isolation that pure prompting cannot guarantee, eliciting pre-existing reasoning capability.
LLM Programs embed LLMs within explicit algorithms that manage control flow and state, presenting only step-specific context to each LLM call. This information hiding addresses capability and context window limits while treating complex reasoning as modular, debuggable sub-tasks.
Even GRPO-trained models show sharp performance drops on out-of-distribution variants (N-1 test sets) compared to in-distribution problems, indicating RL optimizes template-matching rather than genuine problem-solving procedures.
Converting LLMs to action-capable systems requires four distinct stages: curating action-environment-user datasets, training for action grounding, integrating agent infrastructure with memory and tools, and rigorous safety evaluation. The surrounding system and harness determine whether actions are grounded or hallucinated.
Analysis of 22 LLMs across behavioral game theory reveals three dominant profiles: GPT-o1 uses minimax reasoning, DeepSeek-R1 uses trust-based reasoning, and GPT-o3-mini uses belief-anticipation. Performance correlates with game structure, not raw reasoning depth.
Current reasoning models lack the three properties of systematic exploration: validity, effectiveness, and necessity. This causes success probability to drop exponentially with problem depth, making medium problems solvable but deep problems catastrophically harder.
When semantic content is decoupled from reasoning tasks, LLM performance collapses even with correct rules in context. Models rely on parametric commonsense and token associations rather than formal logical manipulation, constraining reasoning to training distribution semantics.