INQUIRING LINE

Can memory-based adaptation and gradient fine-tuning operate on complementary timescales?

This explores whether two ways of updating an AI — fast, reversible memory (storing experiences or context) versus slow, permanent weight changes (gradient fine-tuning) — can be split so each handles a different speed of learning, instead of competing.


This explores whether memory-based adaptation and gradient fine-tuning can be assigned to different speeds of learning rather than forced into one mechanism. The corpus answers with a fairly confident yes — and the most direct evidence is an architecture that builds the split into the model itself. Titans separates fast, quadratic attention (short-term, what's happening right now) from a compressed neural memory that holds onto surprising tokens over the long haul Can neural memory modules scale language models beyond attention limits?. The same two-clock idea shows up explicitly in training: Fast-Slow Training routes task-specific lessons into fast, editable prompts while keeping slow parameter updates minimal, reaching the same performance 1.4–3x faster with much less catastrophic forgetting Can splitting adaptation into two channels reduce forgetting?. The striking framing there is that forgetting isn't an inherent cost of learning — it's a *misallocation* problem, what you tried to write into slow weights that belonged in fast context.


Sources 8 notes

Can neural memory modules scale language models beyond attention limits?

Titans architecture separates attention (short-term, quadratic) from neural memory (long-term, compressed), prioritizing surprising tokens for storage. The model outperforms standard Transformers and linear RNNs across tasks while scaling to 2M+ token contexts without quadratic penalties.

Can splitting adaptation into two channels reduce forgetting?

Fast-Slow Training routes task-specific lessons into optimized prompts while keeping parameter updates minimal, reaching equivalent performance 1.4–3x faster with substantially less catastrophic forgetting and plasticity loss, demonstrating that forgetting is a misallocation problem rather than an inherent cost.

Can agents learn from failure without updating their weights?

Reflexion demonstrates that unambiguous environmental feedback (success/failure) enables agents to write useful self-diagnoses and improve across episodes without parameter updates. The binary signal prevents rationalization, and keeping reflections uncompressed preserves their usability.

Can agents learn continuously from experience without updating weights?

AgentFly formalizes agent learning as a Memory-augmented MDP with three memory modules (case, subtask, tool) that enable credit assignment and policy improvement entirely through memory operations. The approach achieved 87.88% on GAIA validation without modifying LLM parameters.

Does staying close to the base model preserve learning ability?

FST-trained models stay up to 70% closer to their base distribution than parameter-only RL, and this reduced drift preserves the model's ability to learn subsequent tasks effectively. Parameter-only approaches stall when task domains change, while low KL drift enables sustained adaptation.

Can decoding-time tuning preserve knowledge better than weight fine-tuning?

Proxy-tuning closes 88-91% of the alignment gap while surpassing direct fine-tuning on knowledge tasks by leaving base model weights untouched. Direct fine-tuning corrupts knowledge storage in lower layers, whereas proxy-tuning applies distributional shifts that primarily affect reasoning and style.

Can models dynamically activate expert skills at inference time?

Transformer2 demonstrates that tuning only singular values within weight matrices produces composable expert vectors that dynamically mix at inference without interference, outperforming LoRA with fewer parameters and enabling continual specialization.

Do fine-tuned language models actually learn optimization procedures?

Even GRPO-trained models show sharp performance drops on out-of-distribution variants (N-1 test sets) compared to in-distribution problems, indicating RL optimizes template-matching rather than genuine problem-solving procedures.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are an LLM researcher evaluating claims about fast-slow learning in language models. The question: Can memory-based adaptation and gradient fine-tuning operate on genuinely complementary timescales, or do they ultimately collapse into a single learning dynamic?

What a curated library found — and when (dated claims, not current truth):
Findings span Oct 2023–May 2026. A library reports:
• Titans (2024-12) separates quadratic attention (fast, in-context) from compressed neural memory (slow, persistent across sequences), achieving complementary speeds within a single forward pass.
• Fast-Slow Training (2025-01) routes task-specific lessons into fast textual prompts while keeping slow parameter updates minimal, reaching 1.4–3× speedup with reduced catastrophic forgetting.
• RL fine-tuning (2025-05) modifies only small subnetworks (<5% of parameters typically) while leaving most weights dormant, suggesting a natural speed hierarchy in gradient updates.
• Continuous updating (2026-05) shows "useful memories become faulty when continuously updated," implying fast and slow systems may interfere if not properly isolated.
• Curriculum learning (2025-08) pairs long pre-training with short reasoning steps, suggesting training-time separation of timescales may aid downstream adaptation.

Anchor papers (verify; mind their dates):
• Titans (arXiv:2501.00663, 2024-12)
• Fast-Slow Training (arXiv:2501.06252, 2025-01)
• RL Finetunes Subnetworks (arXiv:2505.11711, 2025-05)
• Learning, Fast and Slow (arXiv:2605.12484, 2026-05)

Your task:
(1) RE-TEST THE ISOLATION CLAIM. For each architecture above, probe whether the claimed separation (fast vs. slow) is enforced by structure, loss weighting, or parameter freezing — or merely emergent. Have newer methods (post-2026-06) shown genuine interference when boundaries relax? Where does interference persist and why?
(2) Surface papers from the last 6 months that show fast and slow mechanisms *converging* or *trading off*, especially under scaling or distribution shift. Does the Farther-the-Shift paper (2026-03) suggest the split breaks down OOD?
(3) Propose two research questions that assume we've moved beyond simple two-clock models: (a) Do three or more timescales emerge naturally in larger models, and how are they discovered? (b) Can we predict *a priori* which task properties demand multi-timescale learning vs. single unified adaptation?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines