INQUIRING LINE

How does upward distillation transfer knowledge from smaller to larger networks?

This asks about "upward distillation" — transferring knowledge from a small model up into a larger one — but I should be straight with you first: the corpus has very little on that exact direction, and what it does have actually inverts the premise in interesting ways.


This explores whether knowledge can flow *up* — from a smaller network into a larger one — which is the reverse of how distillation normally works. The standard picture in this collection runs the other way: a big, well-informed teacher compresses what it knows into a smaller student. Does richer teacher context hurt student generalization? shows that even within that conventional setup the transfer is lossy in subtle ways — a teacher that sees the correct answer and verifier output hands the student confident, concise reasoning traces, but that confidence suppresses the student's ability to express uncertainty and quietly degrades its performance on out-of-distribution problems. So even "downward" distillation isn't a clean copy; it transmits style and disposition, not just facts.

The more provocative thread here is that the collection keeps questioning whether bigger is the thing you'd even want to distill *toward*. A single 7M-parameter two-layer network, recursing on its own latent reasoning state, beats DeepSeek R1, o3-mini and Gemini 2.5 Pro on ARC puzzles with a fraction of a percent of their parameters Can tiny recursive networks outperform massive language models?. If a tiny model can out-reason giant ones, the interesting transfer question isn't "how does small teach large" but "what does small *have* that large lacks" — and the answer there is a structural trick (recursion on latent state), not distilled knowledge.

The closest thing the corpus offers to small-feeding-large is aggregation rather than distillation. Routing queries across a panel of small specialists outperforms a single frontier model: ten 7B models with a router beat GPT-4.1 and 4.5, and Avengers-Pro beats GPT-5-medium by sending each query to its best-suited small model Can routing beat building one better model?. Here the capability of many small networks gets composed into something larger-acting — but through selection at inference time, not by pouring their weights into a bigger net. Selection, the work suggests, is a stronger lever than scale.

There's also a representational angle on what actually transfers well between systems. Discrete codes move across domains better than raw text embeddings because the discrete bottleneck strips out source-specific bias Can discrete codes transfer better than text embeddings?, and predicting latent states is exponentially more sample-efficient than predicting tokens because same-level latents are far more correlated than surface tokens Why is predicting latents more sample-efficient than tokens?. If anyone *were* to build genuine upward distillation, these point at the lever: transfer at the level of compact latent or coded structure, not surface outputs.

So the honest synthesis is that this collection doesn't document upward distillation as a working technique — it documents reasons the premise is shakier than it sounds (small models already out-reason large ones; composing small models works better than fusing them; and knowledge in transformers is flowing activation rather than a portable store Do transformer models store knowledge or generate it continuously?, which is part of why moving it between networks is hard at all). The thing worth walking away knowing: the field's energy is going into *selecting and composing* small capable models rather than distilling them upward into bigger ones.


Sources 6 notes

Does richer teacher context hurt student generalization?

Teachers conditioned on correct answers and verifier output produce confident, concise traces that students inherit. This style suppresses uncertainty expression, optimizing in-domain performance while degrading generalization to out-of-distribution problems that require epistemic caution.

Can tiny recursive networks outperform massive language models?

A single 7M-parameter two-layer network recursing on its latent reasoning state achieves 45% on ARC-AGI-1 and 8% on ARC-AGI-2, beating DeepSeek R1, o3-mini, and Gemini 2.5 Pro with 0.01% of their parameters. Recursion on latent state, not scale or hierarchy, drives the generalization gain.

Can routing beat building one better model?

Avengers-Pro achieves 7% higher accuracy than GPT-5-medium by routing queries to optimal models per semantic cluster, or matches its performance at 27% lower cost. Ten 7B models with routing previously surpassed GPT-4.1 and 4.5, suggesting selection is a stronger lever than scaling.

Can discrete codes transfer better than text embeddings?

VQ-Rec demonstrates that mapping item text to discrete codes via product quantization, then to embeddings, improves cross-domain transfer compared to direct text encoding. The discrete intermediate reduces text bias and enables efficient per-domain fine-tuning.

Why is predicting latents more sample-efficient than tokens?

A formal sample-complexity analysis proves latent-level self-supervision (data2vec/JEPA style) recovers compositional structure with samples constant in hierarchy depth, while token-level learning requires exponential samples—because same-level latents are far more correlated than raw tokens.

Do transformer models store knowledge or generate it continuously?

Transformers organize knowledge as flowing activations rather than retrievable archives, mirroring oral cultures where knowledge exists only in performance. This explains why model knowledge is contextual, difficult to edit, and inseparable from generation.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst tracking knowledge transfer in neural networks. The question remains open: *can smaller networks meaningfully teach larger ones, and if so, how?*

What a curated library found — and when (dated claims, not current truth):
Findings span 2022–2026; treat these as time-stamped constraints to re-examine:
- Standard downward distillation is lossy: a teacher's confident traces suppress student uncertainty and degrade out-of-distribution robustness (~2026).
- A 7M two-layer network recursing on latent reasoning outperforms DeepSeek R1, o3-mini, Gemini 2.5 Pro on ARC puzzles (~2025).
- Test-time routing of queries across ten 7B specialists beats GPT-4.1, 4.5, and GPT-5-medium; composition via selection outperforms weight fusion (~2025).
- Discrete codes transfer across domains more robustly than raw embeddings; predicting latent states is exponentially more sample-efficient than token prediction (~2025–2026).
- Knowledge in transformer residual streams flows as activation, not portable storage, making inter-network transfer structurally hard (~2024).

Anchor papers (verify; mind their dates):
- arXiv:2510.04871 (2025-10): "Less is More: Recursive Reasoning with Tiny Networks"
- arXiv:2508.12631 (2025-08): "Beyond GPT-5: Making LLMs Cheaper and Better via Performance-Efficiency Optimized Routing"
- arXiv:2605.27734 (2026-05): "Learn from your own latents and not from tokens: A sample-complexity theory"
- arXiv:2603.24472 (2026-03): "Why Does Self-Distillation (Sometimes) Degrade the Reasoning Capability of LLMs?"

Your task:
(1) RE-TEST EACH CONSTRAINT. For each finding, ask: have newer inference harnesses, orchestration patterns (multi-agent, agentic routing), or training methods since relaxed the constraint? Does a tiny model's recursion still outpace scale, or do frontier models now match it? Does routing still beat fusion? Separate the durable question (likely: *what structural property, not weight transfer, moves capability?*) from the perishable limitation. Cite what resolved it.
(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last 6 months—any paper arguing upward distillation *does* work, or that fusion outperforms routing, or that token-level transfer is viable.
(3) Propose 2 research questions that ASSUME the regime has shifted: e.g., "If tiny models already reason better via recursion than large models via scale, what is upward distillation trying to solve?" or "Can a large model learn the *algorithmic structure* of a small model's recursion without copying weights?"

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines