INQUIRING LINE

Inquiring lines›How should agents manage and coord…›How can training approaches develo…›What capability tradeoffs emerge w…›this inquiring line

You can reshape how an AI reasons using only scaffolding and tools — no retraining required — and often get better results.

Can you control LLM reasoning strategy without fine-tuning the model?

This explores whether you can shape *how* a model reasons — its strategy, its sequence of steps — using external scaffolding (prompts, algorithms, tools) instead of changing the model's weights through training.

This explores whether reasoning strategy is something you steer from the outside — through prompts, control flow, and tool calls — rather than something you have to bake in by fine-tuning. The corpus answers an emphatic yes, and more interestingly, it suggests external control often works *better* than training. The cleanest demonstration is cognitive tools: four reasoning operations implemented as sandboxed model calls lifted GPT-4.1 on a hard math benchmark from 26.7% to 43.3% with no RL training at all Can modular cognitive tools unlock reasoning without training?. The key insight there is that the capability already existed inside the model — what was missing was *structure*. Plain prompting can't guarantee that the model actually isolates one operation at a time, but wrapping operations as discrete tool calls enforces it.

The same logic drives LLM Programs, which embed the model inside an explicit algorithm that manages state and feeds each call only the context relevant to its current step Can algorithms control LLM reasoning better than LLMs alone?. Notice the shared mechanism across both: control comes from *information hiding and modular isolation*, not from teaching the model anything new. You're not changing what the model knows — you're changing what it sees and when. This reframes "reasoning strategy" as an orchestration problem rather than a weights problem.

What makes the no-fine-tuning case stronger is the corpus's skepticism about whether fine-tuning installs reasoning in the first place. RL fine-tuning, it turns out, tends to *sharpen memorization* rather than install genuine procedures — GRPO-trained models collapse on out-of-distribution variants of problems they ace in-distribution, revealing template-matching dressed up as reasoning Do fine-tuned language models actually learn optimization procedures?. And turning a model into an action-taking agent isn't a retraining job either: it requires transforming the whole pipeline — datasets, grounding, memory, tools, safety — with the surrounding harness, not the weights, deciding whether actions are grounded or hallucinated Can you turn an LLM into an agent by just fine-tuning?. The center of gravity keeps shifting from the model to the system around it.

But external control has real ceilings, and this is the part you might not expect. Models arrive with their *own* innate strategic styles — across 22 models, GPT-o1 defaults to minimax, DeepSeek-R1 to trust-based reasoning, o3-mini to belief-anticipation — and these profiles track the model, not just the prompt Do large language models use one reasoning style or many?. So scaffolding steers a model that already has a temperament. Worse, even when you orchestrate the steps, reasoning models tend to *wander* — exploring unsystematically rather than searching, which makes success drop exponentially as problems get deeper Why do reasoning LLMs fail at deeper problem solving?. External structure can elicit and channel latent ability, but it can't conjure a search discipline the model lacks.

The deeper limit is what kind of reasoning lives in there to be controlled. When semantic content is stripped from a task, model performance collapses even with the correct rules sitting in context — these are semantic associators, not symbolic logicians Do large language models reason symbolically or semantically?. So the honest framing is this: you can substantially control reasoning *strategy* without fine-tuning — sequencing, isolation, tool use, which capability gets elicited — but you're conducting an orchestra whose instruments are fixed. Scaffolding redirects existing capability; it doesn't manufacture new kinds of it.

Sources 7 notes

Can modular cognitive tools unlock reasoning without training?

Four cognitive tools implemented as sandboxed LLM calls improved GPT-4.1 on AIME2024 from 26.7% to 43.3% without any RL training. Modularity enforces operation isolation that pure prompting cannot guarantee, eliciting pre-existing reasoning capability.

Can algorithms control LLM reasoning better than LLMs alone?

LLM Programs embed LLMs within explicit algorithms that manage control flow and state, presenting only step-specific context to each LLM call. This information hiding addresses capability and context window limits while treating complex reasoning as modular, debuggable sub-tasks.

Do fine-tuned language models actually learn optimization procedures?

Even GRPO-trained models show sharp performance drops on out-of-distribution variants (N-1 test sets) compared to in-distribution problems, indicating RL optimizes template-matching rather than genuine problem-solving procedures.

Can you turn an LLM into an agent by just fine-tuning?

Converting LLMs to action-capable systems requires four distinct stages: curating action-environment-user datasets, training for action grounding, integrating agent infrastructure with memory and tools, and rigorous safety evaluation. The surrounding system and harness determine whether actions are grounded or hallucinated.

Do large language models use one reasoning style or many?

Analysis of 22 LLMs across behavioral game theory reveals three dominant profiles: GPT-o1 uses minimax reasoning, DeepSeek-R1 uses trust-based reasoning, and GPT-o3-mini uses belief-anticipation. Performance correlates with game structure, not raw reasoning depth.

Show all 7 sources

Why do reasoning LLMs fail at deeper problem solving?

Current reasoning models lack the three properties of systematic exploration: validity, effectiveness, and necessity. This causes success probability to drop exponentially with problem depth, making medium problems solvable but deep problems catastrophically harder.

Do large language models reason symbolically or semantically?

When semantic content is decoupled from reasoning tasks, LLM performance collapses even with correct rules in context. Models rely on parametric commonsense and token associations rather than formal logical manipulation, constraining reasoning to training distribution semantics.

Papers this line draws on 8

The research behind the notes this line reads — ranked by how closely each paper relates.

Is Chain-of-Thought Reasoning of LLMs a Mirage? A Data Distribution Lens1.76 match · arxiv ↗
Beyond Accuracy: Evaluating the Reasoning Behavior of Large Language Models -- A Survey1.75 match · arxiv ↗
Reasoning Strategies in Large Language Models: Can They Follow, Prefer, and Optimize?1.73 match · arxiv ↗
Large Language Models are In-Context Semantic Reasoners rather than Symbolic Reasoners1.72 match · arxiv ↗
Reasoning with Large Language Models, a Survey1.69 match · arxiv ↗
Are Emergent Abilities in Large Language Models just In-Context Learning?1.69 match · arxiv ↗
Efficient Tool Use with Chain-of-Abstraction Reasoning1.68 match · arxiv ↗
LLM Strategic Reasoning: Agentic Study through Behavioral Game Theory0.91 match · arxiv ↗

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are an LLM reasoning researcher. The question remains open: **Can you control LLM reasoning strategy without fine-tuning the model?** Assume the findings below are dated claims (2023–2026), not current truth.

**What a curated library found — and when (claims spanning 2023–2026, verify all):**
- Cognitive tools (sandboxed tool calls for discrete reasoning ops) lifted GPT-4.1 from 26.7% → 43.3% on hard math with zero RL fine-tuning; capability was latent, structure was missing (~2025).
- LLM Programs (explicit algorithms managing state, isolating context per step) achieve reasoning control via information hiding and modular orchestration, not weight updates (~2024–2025).
- RL fine-tuning sharpens *memorization*, not reasoning: GRPO-trained models collapse on out-of-distribution variants, revealing template-matching (~2025).
- Innate strategic profiles (GPT-o1 defaults minimax, DeepSeek-R1 trust-based, o3-mini belief-anticipation across 22 models) are model-intrinsic, not prompt-controlled (~2025).
- Reasoning LLMs *wander* rather than search systematically; success drops exponentially with problem depth (~2025).
- When semantics are stripped, performance collapses even with correct rules in context: LLMs are semantic associators, not symbolic reasoners (~2023).

**Anchor papers (verify; mind their dates):**
- arXiv:2305.14825 (2023) — semantic vs. symbolic reasoning foundations.
- arXiv:2506.12115 (2025) — cognitive tools and modular tool-call orchestration.
- arXiv:2502.20432 (2025) — strategic reasoning profiles across models.
- arXiv:2505.20296 (2025) — wandering vs. systematic search in reasoning LLMs.

**Your task:**
(1) **RE-TEST EACH CONSTRAINT.** For every finding above, judge whether newer models (o4-class, updated DeepSeek, etc.), better orchestration harnesses (memory/caching/multi-agent), or novel evaluation designs have *relaxed* or *overturned* the ceiling. Separate the durable question (Can you steer strategy without retraining?) from perishable limitations (current models wander; current fine-tuning memorizes). Cite what resolved each, plainly state where constraints still hold.
(2) **Surface the strongest CONTRADICTING or SUPERSEDING work** from the last ~6 months that either revives fine-tuning's role in strategy control or demonstrates external scaffolding has hit a hard wall.
(3) **Propose 2 research questions that assume the regime may have moved**: e.g., "Do multi-token reasoning chains + constraint-enforcing beam search overcome wandering?" or "Can retrieval-augmented prompting inject symbolic discipline into semantic reasoners?"

Cite arXiv IDs; flag anything you cannot ground in a real paper.

You can reshape how an AI reasons using only scaffolding and tools — no retraining required — and often get better results.

Related lines of inquiry

Sources 7 notes

Papers this line draws on 8