SYNTHESIS NOTE

Can length generalization transfer between different related tasks?

Can a model trained on longer sequences in one task learn to handle longer inputs in a related task without explicit training? This matters for understanding how neural networks reuse computational strategies across problems.

Synthesis note · 2026-02-23 · sourced from Context Engineering

The "Extrapolation by Association" paper demonstrates a specific mechanism for out-of-distribution generalization: length generalization — the ability to handle longer inputs than seen during training — can transfer from one task to another.

The setup: train multiple related tasks jointly, where an "auxiliary task" uses longer inputs and a "main task" uses shorter inputs. The finding: the main task generalizes to the length of the longer auxiliary task, even though it was never trained at that length. This works across arithmetic operations, string transformations, and maze navigation — diverse algorithmic domains sharing an underlying structural similarity.

The mechanistic evidence is precise: length generalization transfer correlates with the reuse of the same attention heads between tasks. The model doesn't learn separate length-handling circuitry per task. Instead, it develops shared computational infrastructure that handles the length dimension, and this infrastructure transfers because the related tasks route through the same attention heads.

The pretrained-model finding extends this further: pretrained language models already exhibit similar transfer effects, suggesting that pretraining equips models with "reusable computational scaffolding" that facilitates extrapolation in downstream settings. The scaffolding is not task-specific — it is a general capability for processing longer sequences that was acquired during pretraining and can be activated by fine-tuning on related tasks.

This connects to Do base models already contain hidden reasoning ability? through a shared principle: pretraining installs capabilities that later training surfaces rather than creates. The base model already has the computational scaffolding for length handling; the auxiliary task merely activates it for the main task.

The connection to Do neural networks naturally learn modular compositional structure? is direct: attention head reuse across tasks is a specific instance of modular subnetwork sharing. The decomposition into reusable modules happens naturally, and pretraining encourages it — exactly the compositional generalization thesis applied to the length dimension.

Since Can neural networks learn compositional skills without symbolic mechanisms?, length generalization may follow the same scaling trajectory — more data and larger models produce more transferable attention head circuits. The practical implication: training on a diverse set of related tasks at varying lengths may be more efficient than training each task independently at the target length.

Inquiring lines that read this note 13

This note is a source for these research framings, grouped by the broader line of inquiry each explores. Scan the bold lines of inquiry; follow any specific question forward.

How do self-generated feedback mechanisms enable effective model learning?

Does extended exoskeleton use eventually produce meaningful skill transfer?

What determines success in training models on multiple tasks?

Does model scaling alone produce compositional generalization without symbolic mechanisms?

Does fine-tuning modify underlying model capabilities or only behavioral outputs?

Why do continual learning scenarios trigger catastrophic forgetting and interference?

What are the consequences of models training on synthetic data?

Why does the same training data produce different gains across models?

What role does compression play in language model capability and generalization?

Can compression length really indicate how well a model generalizes?

Related concepts in this collection 3

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

13 direct connections · 124 in 2-hop network ·dense cluster Open in graph ↗

Can length generalization transfer between diffe… Do base models already contain hidden reasoning ab… Do neural networks naturally learn modular composi… Can neural networks learn compositional skills wit…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Do base models already contain hidden reasoning ability? Explores whether reasoning capability emerges during pre-training as a latent feature rather than being created by post-training methods like reinforcement learning or fine-tuning.
shared principle: pretraining installs capabilities that later training surfaces
Do neural networks naturally learn modular compositional structure? Explores whether neural networks decompose compositional tasks into distinct subroutines without explicit symbolic design. This challenges the longstanding view that neural networks are fundamentally non-compositional.
attention head reuse is a concrete instance of modular subnetwork sharing
Can neural networks learn compositional skills without symbolic mechanisms? Do neural networks need explicit symbolic architecture to compose learned concepts, or can scaling alone enable compositional generalization? This asks whether compositionality is an architectural feature or an emergent property of scale.
length generalization may share the same scaling dynamics

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

length generalization transfers across related tasks via shared attention head reuse — pretraining provides reusable computational scaffolding

Can length generalization transfer between different related tasks?

Inquiring lines that read this note 13

Related concepts in this collection 3

Related papers in this collection 8

Search by related questions 4