Can agents learn to use scaffolding structure the way they learn token weights?
This explores whether the scaffolding around an agent — its memory, skills, and the wiring that connects its steps — can be learned and improved the way a model's weights are, rather than staying a fixed hand-built harness.
This question reads the scaffolding (memory stores, skill libraries, the graph of steps an agent runs) as a second substrate that might be tunable the same way weights are tuned during training. The corpus suggests the answer is yes — and that for a lot of agent work, learning the structure is cheaper and more durable than touching the weights at all. The clearest demonstration is that whole classes of improvement happen with the parameters frozen: AgentFly treats learning as memory operations inside a memory-augmented decision process and reaches strong GAIA scores without changing a single weight Can agents learn continuously from experience without updating weights?, and Reflexion shows an agent writing verbal post-mortems into episodic memory and improving across attempts on nothing more than a success/failure signal Can agents learn from failure without updating their weights?. In both, the thing being updated is the scaffold, not the network.
The most direct answer to your literal question — can structure be learned *the way* weights are — comes from treating the scaffold as an optimizable object with its own gradient-like objective. Language agents can be written as computational graphs where nodes are operations and edges are information flow, which reveals that techniques like chain-of-thought, tree-of-thought, and Reflexion are formally the same shape; once you have that representation you can automatically optimize both the prompts inside nodes and the connections between them Can we automatically optimize both prompts and agent coordination?. That is scaffolding-as-learnable-parameters in the most literal sense. The same move shows up in retrieval: instead of reading a whole graph, an agent can learn a traversal *policy* over it with MCTS and reinforcement learning, so the navigation structure itself becomes the trained artifact Can learned traversal policies beat exhaustive graph reading?.
Where it gets interesting is that scaffolding and weights turn out to be partly interchangeable, not just parallel. LatentSkill compiles textual skills into LoRA adapters via a hypernetwork — the same competence living as plaintext in context can be moved into weight-space, cutting prefill tokens 64–72% and, crucially, becoming composable through parameter arithmetic Can skills work better as weights than as prompts?. So the boundary between 'learned as structure' and 'learned as weights' is a design dial, not a hard line. VOYAGER sits on the structural side of that dial: it stores executable skills in an embedding-indexed library and builds complex skills from simple ones, getting lifelong learning precisely *because* it avoids weight updates and the catastrophic forgetting they cause Can agents learn new skills without forgetting old ones?.
The payoff of learning the scaffold is that it dodges the ceilings weight-training runs into. Agents trained only on static expert demonstrations are capped at whatever the curators imagined, because they never interact and never learn from their own failures Can agents learn beyond what their training data shows?. Structure-learning routes around that: self-play loops can manufacture the missing feedback themselves, co-evolving natural-language skills as a Challenger raises difficulty and a Judge issues verdicts — skills edited in language, no weights touched Can language models learn skills without human supervision?. And the structure can fold and reorganize itself over time, with agents compressing their own interaction history into episodic, working, and tool schemas rather than carrying raw logs Can agents compress their own memory without losing critical details?.
The thing you didn't know you wanted to know: 'weights' and 'scaffolding' aren't two answers to your question — they're two ends of one continuum the field is actively learning to slide along. The same skill can be a prompt, a memory entry, a graph edge, or a LoRA adapter, and the open research question is no longer *whether* structure can be learned but *which substrate* a given competence should be compiled into.
Sources 9 notes
AgentFly formalizes agent learning as a Memory-augmented MDP with three memory modules (case, subtask, tool) that enable credit assignment and policy improvement entirely through memory operations. The approach achieved 87.88% on GAIA validation without modifying LLM parameters.
Reflexion demonstrates that unambiguous environmental feedback (success/failure) enables agents to write useful self-diagnoses and improve across episodes without parameter updates. The binary signal prevents rationalization, and keeping reflections uncompressed preserves their usability.
Language agents represented as computational graphs—where nodes are operations and edges define information flow—reveal that CoT, ToT, and Reflexion are formally equivalent structures. This unified view enables automatic optimization of both node prompts and edge connectivity without manual redesign.
Graph-O1 replaces whole-graph ingestion with step-by-step agentic navigation using Monte Carlo Tree Search and reinforcement learning. This approach fits within LLM context windows while learning domain-specific traversal policies, though it trades certainty about the full graph for decision-making under uncertainty.
LatentSkill uses a hypernetwork to convert textual agent skills into plug-and-play LoRA adapters, reducing prefill tokens by 64–72% while maintaining or beating in-context baselines. Weight-space skills form composable semantic structures that can be scaled and combined through parameter arithmetic.
VOYAGER demonstrates that storing executable skills in an embedding-indexed library and composing complex skills from simpler ones allows agents to learn continuously while avoiding the forgetting that occurs with weight-update-based methods. Environmental feedback refines skills while an automatic curriculum drives continual exploration.
Agents trained on static expert datasets cannot learn from their own failures or generalize beyond demonstrated scenarios because they never interact with environments during training. Competence is capped by what curators imagined, not by agent capacity.
Ctx2Skill's three-role self-play loop manufactures missing feedback through internal signals: the Challenger escalates difficulty as curriculum, the Judge gives binary verdicts as reward, and both sides evolve via natural-language skill edits. Success requires balancing adversarial pressure against a generalization safeguard to prevent collapse.
DeepAgent's autonomous memory folding consolidates interaction history into episodic, working, and tool memory schemas. This reduces token overhead while letting agents pause to reconsider strategies—the autonomy and structure together avoid degradation that plagues poorly designed consolidation.