Can we predict out-of-distribution generalization without access to downstream tasks?
This explores whether you can forecast how a model will behave on data unlike its training — and where it'll break — from the model's structure and training alone, without first running it on the target tasks.
This explores whether you can forecast how a model behaves outside its training distribution — and where it'll fail — from its architecture and training signature alone, rather than by testing it on the downstream tasks. The corpus offers a genuinely encouraging answer: in several cases, yes, because failure is structured, not random. The strongest example reframes an LLM as an autoregressive probability machine and predicts, ahead of any task-specific evaluation, that prompts demanding low-probability outputs will be hard even when they're logically trivial — and experiments confirmed it on things like reciting the alphabet backwards or counting letters Can we predict where language models will fail?. The lever there is the *computational level*: characterize what kind of machine the model is, and you can anticipate the shape of its out-of-distribution behavior without enumerating tasks.
A second line says generalization decays *predictably* once you leave the training distribution. Chain-of-thought reasoning doesn't fall off a cliff randomly; it degrades systematically as you shift task, length, or format, producing fluent-but-invalid logic — a regularity you could in principle measure as distributional distance rather than by collecting downstream labels Does chain-of-thought reasoning actually generalize beyond training data?. That predictability is exactly what a task-free predictor needs: if the degradation curve is a function of how far the input drifts, distance becomes the proxy for performance.
The more interesting twist is that the corpus disagrees about *what* you'd even measure. One result argues the internal structure carries the signal: networks decompose compositional tasks into isolated, prunable subnetworks, and pretraining makes that modular scaffolding more consistent — meaning the capacity to recombine for novel inputs is visible in the weights, not just the outputs Do neural networks naturally learn modular compositional structure?. Relatedly, length generalization isn't a per-task lottery; it transfers because related tasks reuse shared attention heads already present in the pretrained model Can length generalization transfer between different related tasks?. If the reusable machinery already exists, you can reason about transfer to unseen lengths without running them.
But a sharp caution runs the other way. Instruction tuning experiments show that what a model appears to 'generalize' can be an illusion of the output-space distribution rather than task understanding — models trained on semantically empty or wrong instructions match correct ones Does instruction tuning teach task understanding or output format?. So any task-free predictor that keys off surface fluency will mistake format-matching for genuine OOD competence. And a distribution-level proxy has its own confound: staying close to the base distribution (low KL drift) preserves plasticity and continued adaptability Does staying close to the base model preserve learning ability?, which suggests proximity-to-base is itself a measurable, training-time predictor of how well a model will keep generalizing — no downstream task required.
The synthesis worth leaving with: the corpus doesn't have a single 'OOD predictor' paper, but it converges on a usable principle. You can predict out-of-distribution behavior without downstream tasks *if* you predict from the right level — the autoregressive computation, the distance-decay curve, the modular subnetworks, the KL distance from base — and *not* from output fluency, which is precisely the thing that lies to you. The open question the corpus surfaces but doesn't close: whether the structural signals (subnetworks, shared heads) and the distributional signals (KL drift, autoregressive probability) are two views of the same predictor or two competing ones.
Sources 6 notes
By framing LLMs as autoregressive probability machines, researchers predicted tasks with low-probability target responses would be systematically harder, even when logically simple. Experiments confirmed predictions like backwards alphabet and letter counting.
DataAlchemy experiments show CoT fails systematically under distributional shifts in task, length, and format. Models produce fluent but logically inconsistent reasoning — imitating reasoning form without valid underlying logic.
Pruning experiments reveal that neural networks implement compositional subroutines in isolated subnetworks, with ablations affecting only their corresponding function. Pretraining substantially increases the consistency and reliability of this modular structure across architectures and domains.
Models trained jointly on related tasks reuse the same attention heads to handle length generalization, allowing shorter tasks to extrapolate beyond their training length. Pretrained models already contain this reusable computational scaffolding.
Models trained on semantically empty or deliberately incorrect instructions achieve comparable performance to those trained on full correct instructions, achieving 43% vs random baseline 42.6%. The semantic content of instructions appears largely irrelevant; what transfers is knowledge of the output space.
FST-trained models stay up to 70% closer to their base distribution than parameter-only RL, and this reduced drift preserves the model's ability to learn subsequent tasks effectively. Parameter-only approaches stall when task domains change, while low KL drift enables sustained adaptation.