INQUIRING LINE

How does adjacent layer sharing differ from non-adjacent weight reuse?

This explores weight sharing in transformers — reusing the same parameters in more than one place — and specifically why it matters whether the reused layers sit next to each other or far apart in the stack.


This explores weight sharing in transformers — reusing the same parameters in more than one place — and specifically whether it matters that the reused layers are neighbors versus spread across the network. The corpus has a sharp, concrete answer for the adjacent case and treats the non-adjacent case mostly by implication, so it's worth saying that up front.

The clearest finding is that adjacent sharing wins for a reason that has nothing to do with the math of the model and everything to do with the hardware. MobileLLM shares weights between two consecutive transformer blocks by running the same block twice in a row — and on memory-bound mobile devices, recomputing that block costs *less* than fetching a separate block's weights from memory Does recomputing weights cost less than moving them on mobile?. The weights for the block you just ran are already sitting in fast cache. Reuse them immediately and you skip a slow memory trip entirely. That locality is the whole trick. Non-adjacent reuse — pulling the same weights back in after several other layers have run — breaks it, because by then those weights have been evicted and you pay the memory-movement cost you were trying to avoid. So the difference isn't really 'adjacent vs. non-adjacent layers'; it's 'recompute-from-cache vs. re-fetch-from-memory.'

There's a deeper architectural reason adjacency is a natural fit, too. The same MobileLLM work shows that for small models, depth beats width — stacking more layers lets the network compose abstract concepts step by step, which matters more than spreading parameters sideways Does depth matter more than width for tiny language models?. If consecutive layers are doing closely related compositional work, reusing a block across that short span costs you little in capability while doubling effective depth for free. Reuse across distant layers asks one set of weights to do two unrelated jobs, which is a harder bargain.

The broader theme the corpus keeps circling is that the real bottleneck in modern models is often compute-vs-memory movement, not raw capacity — long-context work makes the same point from a different angle, finding the limiting factor is the compute to consolidate information into internal state rather than the memory to hold it Is long-context bottleneck really about memory or compute?. Adjacent sharing is one clean exploit of that asymmetry: trade cheap recomputation for expensive data movement. One honest caveat — there's a risk in any aggressive reuse that a model looks fine on metrics while its internal representations are quietly fractured, a failure standard evaluation misses Can models be smart without organized internal structure?. The corpus doesn't have a dedicated study comparing adjacent and non-adjacent reuse head-to-head, so if that exact comparison is what you're after, this is where the trail goes cold — but the locality argument tells you why the adjacent case is the one that pays off.


Sources 4 notes

Does recomputing weights cost less than moving them on mobile?

MobileLLM shows that on memory-bound mobile hardware, sharing weights between adjacent transformer blocks by recomputing one block twice uses less latency than fetching separate weights, gaining accuracy with no parameter increase.

Does depth matter more than width for tiny language models?

MobileLLM shows deep-and-thin architectures yield 2.7–4.3% accuracy gains over balanced designs at 125M–350M scale by composing abstract concepts through layers rather than spreading parameters across width.

Is long-context bottleneck really about memory or compute?

Research shows the bottleneck is not memory capacity but the compute required to consolidate evicted context into fast weights during offline sleep phases. Performance improves with more consolidation passes, following a test-time scaling pattern on harder reasoning tasks.

Can models be smart without organized internal structure?

Models trained with SGD can contain all the linearly decodable features needed for a task while maintaining fundamentally broken internal organization. This makes them vulnerable to perturbation and distribution shift invisible to standard evaluation metrics.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a researcher auditing weight-sharing strategies in transformers. The question: does adjacent layer sharing outperform non-adjacent reuse, and why? A curated library (2022–2025) found the answer is rooted in hardware memory locality, not model capacity—but those findings are now 1–3 years old.

What a curated library found — and when (dated claims, not current truth):
• Adjacent weight sharing wins on memory-bound devices because recomputing a block costs less than fetching weights from main memory; weights stay in L1/L2 cache (MobileLLM, 2024-02).
• Non-adjacent reuse breaks cache locality—by the time you need those weights again, they've been evicted, forcing expensive re-fetch (~2024).
• For sub-billion models, depth (stacking layers with shared weights) beats width; adjacent sharing trades cheap recomputation for expensive data movement (~2024).
• A deeper risk: models may show identical metrics while internal representations become fractured under aggressive reuse—standard evals miss this failure mode (~2024).
• The broader bottleneck across long-context and small-model work is compute-to-consolidate-state, not memory capacity itself (~2024).

Anchor papers (verify; mind their dates):
• arXiv:2402.14905 — MobileLLM (2024-02): adjacent sharing + depth > width
• arXiv:2511.13653 — Weight-sparse transformers (2025-11): sparsity + interpretability
• arXiv:2512.24601 — Recursive Language Models (2025-12): recursive depth strategies
• arXiv:2507.16784 — Subconscious Threads (2025-07): latent state consolidation

Your task:
(1) RE-TEST THE CACHE LOCALITY CLAIM. Has newer hardware (e.g., unified memory, newer GPU/TPU cache hierarchies, inference-optimized chips) shifted the memory-movement bottleneck? Do recent efficient inference SDKs (vLLM, TensorRT, ExecuTorch) and layer-wise orchestration patterns (token-level batching, speculative decoding) relax the adjacent-only constraint? Cite what has and hasn't changed.
(2) Surface work from the last 6 months on WEIGHT SHARING under different compute regimes (sparse, quantized, MoE, recursive). Does any contradict or supersede the locality thesis?
(3) Propose two research questions: (a) Does non-adjacent reuse improve if you add explicit cache-coherence or prefetch signaling? (b) Can you decouple adjacency (spatial) from temporal reuse patterns—i.e., reuse a weight block at distant layers if you control *when* it's fetched?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines