SYNTHESIS NOTE
Model Architecture and Internals

Can length generalization transfer between different related tasks?

Can a model trained on longer sequences in one task learn to handle longer inputs in a related task without explicit training? This matters for understanding how neural networks reuse computational strategies across problems.

Synthesis note · 2026-02-23 · sourced from Context Engineering
What kind of thing is an LLM really? How should we allocate compute budget at inference time?

The "Extrapolation by Association" paper demonstrates a specific mechanism for out-of-distribution generalization: length generalization — the ability to handle longer inputs than seen during training — can transfer from one task to another.

The setup: train multiple related tasks jointly, where an "auxiliary task" uses longer inputs and a "main task" uses shorter inputs. The finding: the main task generalizes to the length of the longer auxiliary task, even though it was never trained at that length. This works across arithmetic operations, string transformations, and maze navigation — diverse algorithmic domains sharing an underlying structural similarity.

The mechanistic evidence is precise: length generalization transfer correlates with the reuse of the same attention heads between tasks. The model doesn't learn separate length-handling circuitry per task. Instead, it develops shared computational infrastructure that handles the length dimension, and this infrastructure transfers because the related tasks route through the same attention heads.

The pretrained-model finding extends this further: pretrained language models already exhibit similar transfer effects, suggesting that pretraining equips models with "reusable computational scaffolding" that facilitates extrapolation in downstream settings. The scaffolding is not task-specific — it is a general capability for processing longer sequences that was acquired during pretraining and can be activated by fine-tuning on related tasks.

This connects to Do base models already contain hidden reasoning ability? through a shared principle: pretraining installs capabilities that later training surfaces rather than creates. The base model already has the computational scaffolding for length handling; the auxiliary task merely activates it for the main task.

The connection to Do neural networks naturally learn modular compositional structure? is direct: attention head reuse across tasks is a specific instance of modular subnetwork sharing. The decomposition into reusable modules happens naturally, and pretraining encourages it — exactly the compositional generalization thesis applied to the length dimension.

Since Can neural networks learn compositional skills without symbolic mechanisms?, length generalization may follow the same scaling trajectory — more data and larger models produce more transferable attention head circuits. The practical implication: training on a diverse set of related tasks at varying lengths may be more efficient than training each task independently at the target length.

Inquiring lines that use this note as a source 11

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 3

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map
13 direct connections · 117 in 2-hop network ·dense cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

length generalization transfers across related tasks via shared attention head reuse — pretraining provides reusable computational scaffolding