SYNTHESIS NOTE

Topics›Reasoning Architectures›this note

Can modular cognitive tools unlock reasoning without training?

Can reasoning capabilities be elicited by structuring LLM calls as isolated cognitive operations—understanding, recalling, examining, and backtracking—rather than through reinforcement learning?

Synthesis note · 2026-02-22 · sourced from Reasoning Architectures

Cognitive architectures in psychology posit that reasoning arises from the orchestrated, sequential execution of modular, predetermined cognitive operations. The Cognitive Tools paper instantiates this in a modern tool-calling framework: four cognitive tools are implemented as discrete functions, each executed by the same LLM in a sandboxed context.

The four cognitive tools:

Understand question: Breaks down the problem by identifying main concepts, extracting relevant information, highlighting properties/theorems/techniques that might help
Recall related: Retrieves related knowledge of similar questions the model knows how to answer — guides reasoning through analogous examples
Examine answer: Self-evaluation of a generated answer
Backtracking: Returns to a prior reasoning state when a path appears unproductive

Unlike standard agentic tools (external APIs, calculators), cognitive tools encapsulate reasoning operations within the LLM itself. Each tool's schema includes a prompt template that isolates a specific cognitive operation; the LLM executes it in sandboxed context and feeds the structured result back into the main reasoning loop.

Results: GPT-4.1 on AIME2024 improves from 26.7% to 43.3% pass@1 — approaching o1-preview performance without any RL training. Similar gains across closed and open-weight models.

The key insight: modularity reduces interference between operations. Cognitive prompting (monolithic structured prompts) improves reasoning but lacks the isolation that makes modular cognitive architectures powerful. A tool-calling implementation enforces the sandboxed execution that pure prompting cannot guarantee.

This provides direct evidence for Do base models already contain hidden reasoning ability? — cognitive tools elicit pre-existing latent capability through structured invocation, not through training. The tool-calling framework is the elicitation mechanism.

The connection to Can structured argument prompts make LLM reasoning more rigorous?: both use structured decomposition of reasoning requirements to improve performance. Cognitive tools generalize this from argumentation-specific structure to domain-general cognitive operations.

Self-Discover as predecessor: Self-Discover (Zhou et al., 2024) is the clearest precursor to cognitive tools. It implements a two-stage process: (1) SELECT relevant atomic reasoning modules from a predefined set (critical thinking, step-by-step thinking, decomposition, etc.), (2) ADAPT selected modules to the specific task, (3) IMPLEMENT as a structured reasoning plan. The key difference from cognitive tools: Self-Discover composes a task-specific plan at inference time with only 3 extra inference steps — cheaper than the tool-calling loop but less modular. Self-Discover is more efficient (no sandboxed execution overhead) while cognitive tools provide stronger isolation between operations.

Inquiring lines that read this note 127

This note is a source for these research framings, grouped by the broader line of inquiry each explores. Scan the bold lines of inquiry; follow any specific question forward.

How faithfully do LLMs reflect their actual reasoning in outputs and explanations?

Can AI-generated outputs constitute genuine knowledge or valid claims?

How does instrumental reasoning reproduce pre-Enlightenment knowledge structures?

How does AI assistance affect human cognitive development and reasoning autonomy?

How can LLM user simulators model realistic goal-driven conversation?

Do base models contain latent reasoning that training can unlock?

How do neural networks separate factual knowledge from reasoning abilities?

What constrains reinforcement learning's ability to expand model reasoning?

Can RLVR expand a model's reasoning capabilities beyond its training ceiling?

How do training data properties shape reasoning capability development?

What capability tradeoffs emerge when scaling model reasoning abilities?

How does latent reasoning compare to verbalized chain-of-thought?

How should we design LLM systems to maintain alignment and control?

How does the outer loop escape its own LLM's knowledge boundaries when discovering mechanisms?

Does decoupling planning from execution improve multi-step reasoning accuracy?

How do multi-agent systems achieve genuine cooperation and reasoning?

How do cognitive stimulation and process losses interact in group AI systems?

Why do reasoning models fail at systematic problem-solving and search?

Does reinforcement learning teach reasoning or just when to reason?

What memory architectures best support persistent reasoning across extended interactions?

Can episodic and semantic memory improve long-horizon task reasoning?

Can prompting inject entirely new knowledge into language models?

How do language models establish social grounding in human dialogue?

What cognitive capacities do LLMs actually lack that commentary assumes they have?

Do language models perform faithful symbolic reasoning independent of semantic grounding?

Can prompting strategies overcome LLM biases without model fine-tuning?

What critical LLM failures do standard benchmarks hide?

Do LLMs fail exploration because of context integration or computational limitations?

How do LLMs distinguish causal reasoning from temporal and semantic associations?

How do language models inherit human biases from training data?

Which knowledge types do LLMs handle better than humans in reasoning tasks?

Do accurate-looking LLM outputs hide structural failures in learning and reasoning?

How do LLMs default to surface-level strategies instead of genuine mental simulation?

How can models identify insufficient information and respond appropriately without guessing?

Does fine-tuning modify underlying model capabilities or only behavioral outputs?

How does the functional separation of knowledge and reasoning affect adaptation methods?

Why do benchmark improvements fail to reflect actual reasoning quality?

What explains the gap between perplexity performance and actual reasoning capability?

How does reasoning graph topology affect breakthrough insights and generalization?

Why does training format shape reasoning strategy more than domain content?

Does training data format shape which reasoning strategies LLMs develop?

How can AI agents autonomously learn and transfer skills across tasks?

What role does self-learning play in improving agent reasoning without annotation?

Does domain specialization cause models to lose capabilities elsewhere?

Can capability boundary collapse be addressed by operating at representational rather than token level?

How should conversational agents balance goal-driven initiative with user control?

Can a separate mediator layer improve intent understanding before task execution?

How effectively do deterministic tools improve language model reasoning on formal tasks?

How do soft continuous representations explore multiple reasoning paths simultaneously?

How do soft token mixtures enable parallel reasoning exploration without explicit training?

Do language models understand semantics or rely on pattern matching?

What semantic information is necessary to preserve for sound LLM reasoning?

Can language model hallucination be prevented or only managed?

How does interleaving reasoning with action prevent hallucination?

Does recurrence enable reasoning capabilities that fixed-depth transformers cannot achieve?

Can energy-based transformers achieve deep reasoning without supervision?

When do multi-agent approaches outperform single model extended thinking?

Can smaller LLMs perform tool use tasks through modular decomposition?

Do language models learn genuine linguistic structure or just surface patterns?

How does tool integration leverage comprehension without demanding perfect generation?

How should iterative research systems allocate reasoning per search step?

How does o1-style reasoning relate to learned search processes versus memorized solutions?

Does externalizing cognitive work and state improve agent reliability?

How can harnesses externalize bookkeeping so models focus on semantic judgment?

Related concepts in this collection 5

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

19 direct connections · 199 in 2-hop network ·dense cluster Open in graph ↗

Can modular cognitive tools unlock reasoning wit… Do base models already contain hidden reasoning ab… Can structured argument prompts make LLM reasoning… Does RL teach reasoning or just when to use it? Can reasoning and tool execution be truly decouple… Can we automatically optimize both prompts and age…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Do base models already contain hidden reasoning ability? Explores whether reasoning capability emerges during pre-training as a latent feature rather than being created by post-training methods like reinforcement learning or fine-tuning.
cognitive tools elicit pre-existing capability without training
Can structured argument prompts make LLM reasoning more rigorous? Does requiring language models to explicitly check warrants, backing, and rebuttals—rather than reasoning freely—improve reasoning quality and catch failures that standard step-by-step prompting misses?
same principle: structured reasoning decomposition improves performance
Does RL teach reasoning or just when to use it? Does reinforcement learning in thinking models actually create new reasoning abilities, or does it simply teach existing capabilities when to activate? This matters for understanding where reasoning truly emerges.
cognitive tools is an alternative to RL as the elicitation mechanism
Can reasoning and tool execution be truly decoupled? Can LLM reasoning be separated from tool observations to eliminate redundant re-prompting and enable parallel execution? Two recent architectures suggest yes, but what are the tradeoffs?
both use tool-calling architecture for reasoning; cognitive tools targets internal operations, CoA/ReWOO target external calls
Can we automatically optimize both prompts and agent coordination? This explores whether language agents can be represented as computational graphs whose structure and content adapt automatically. Why it matters: current agent systems require hand-engineered orchestration; automatic optimization could unlock more capable multi-agent systems.
cognitive tools are node-level operations within the computational graph framework: understand, recall, examine, and backtrack are function nodes whose composition forms an agent-level reasoning graph; the graph framework suggests these cognitive operations could be automatically optimized and recombined

Can modular cognitive tools unlock reasoning without training?

Inquiring lines that read this note 127

Related concepts in this collection 5

Related papers in this collection 8

Search by related questions 4