INQUIRING LINE

How does layer removal affect transformers compared to ResNets?

This explores what happens when you delete layers from a transformer — and whether the corpus supports the well-known finding that transformers (like ResNets) tolerate layer removal because their residual connections make each layer an incremental edit rather than a load-bearing pillar.


This explores what happens when you delete layers from a transformer versus a ResNet — and here's the honest framing up front: the collection has no note that runs the head-to-head layer-removal experiment on both architectures, so I can't hand you that direct comparison. What the corpus *does* hold is the mechanism that makes the comparison interesting in the first place, scattered across several notes that never mention ResNets by name.

The shared ingredient is the residual stream. One note reframes the transformer's residual pathway as a channel where knowledge is a continuous *flow* of activations rather than something stored in any single layer Do transformer models store knowledge or generate it continuously?. That's the same architectural trick ResNets introduced: each layer reads the running sum, adds a small correction, and writes it back. When every layer is an incremental edit on a shared bus rather than an irreplaceable stage in a pipeline, removing one layer perturbs the sum a little instead of severing it — which is exactly why both families degrade gracefully under ablation instead of breaking outright.

The corpus also suggests *why* some layers are more deletable than others. Adjacent transformer blocks turn out to be redundant enough that you can share weights between them — recomputing one block twice in place of fetching a second — with no accuracy loss Does recomputing weights cost less than moving them on mobile?. If neighboring blocks are that interchangeable, they're also the cheapest to remove. Against that, layers do carry distinct jobs: models compute correct answers in early layers and then overwrite them downstream Do transformers hide reasoning before producing filler tokens?, and multi-hop reasoning is built up in developmental stages across depth How do transformers learn to reason across multiple steps?. So removal isn't uniform — deleting a redundant middle block is survivable, deleting the early layers where the computation actually happens is not.

The cleanest evidence for *localized* removal effects comes from pruning experiments showing neural networks decompose tasks into modular subnetworks, where ablating one subnetwork knocks out only its specific function and leaves the rest intact Do neural networks naturally learn modular compositional structure?. That modularity — strengthened by pretraining and observed across architectures — is the structural reason removal tends to produce graceful, targeted degradation rather than collapse.

The twist worth taking away: depth is not a smooth dial. Scaling self-supervised RL networks toward 1000 layers shows capabilities switching on at *critical thresholds* — walking appears at depth 16, wall-climbing at depth 256 — in discontinuous jumps Does network depth unlock qualitatively new behaviors in RL?. The implication for layer removal is sharper than 'fewer layers, slightly worse': if a behavior only exists above a depth threshold, pulling layers below that line doesn't dim the capability, it deletes it cliff-edge. Graceful most of the time, catastrophic right at the threshold — and the corpus gives you the residual-stream mechanism to understand both regimes.


Sources 6 notes

Do transformer models store knowledge or generate it continuously?

Transformers organize knowledge as flowing activations rather than retrievable archives, mirroring oral cultures where knowledge exists only in performance. This explains why model knowledge is contextual, difficult to edit, and inseparable from generation.

Does recomputing weights cost less than moving them on mobile?

MobileLLM shows that on memory-bound mobile hardware, sharing weights between adjacent transformer blocks by recomputing one block twice uses less latency than fetching separate weights, gaining accuracy with no parameter increase.

Do transformers hide reasoning before producing filler tokens?

Logit lens analysis shows models trained with hidden CoT tokens compute correct answers in layers 1-3, then actively suppress these representations in final layers to produce format-compliant filler output. The reasoning is fully recoverable from lower-ranked token predictions.

How do transformers learn to reason across multiple steps?

Controlled training reveals transformers learn multi-hop reasoning in three phases: memorization, in-distribution generalization, and cross-distribution reasoning. Successful reasoning correlates with cosine clustering of entity representations, and second-hop generalization requires explicit compositional exposure during training.

Do neural networks naturally learn modular compositional structure?

Pruning experiments reveal that neural networks implement compositional subroutines in isolated subnetworks, with ablations affecting only their corresponding function. Pretraining substantially increases the consistency and reliability of this modular structure across architectures and domains.

Does network depth unlock qualitatively new behaviors in RL?

Scaling to 1000-layer networks in self-supervised RL produces dramatic capability jumps at specific thresholds—depth 16 enables walking, depth 256 enables wall-climbing—driven by synergistic gains in both exploration and expressivity rather than gradual improvement.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst examining layer-removal robustness in deep networks. The question: *Why do transformers and ResNets degrade gracefully when layers are deleted, and does that graceful degradation break down under certain conditions?* This remains open across both architectures.

What a curated library found — and when (dated claims, not current truth):
Findings span 2019–2025 and focus on the residual stream as the common mechanism:
• Both transformers and ResNets route knowledge through a shared residual pathway (running sum) rather than discrete layers; removing one layer perturbs the sum instead of severing computation, enabling graceful degradation (~2023–2024).
• Adjacent transformer blocks are redundant enough to share weights with zero accuracy loss; neighboring layers are thus among the cheapest to remove (~2024).
• Early layers perform core reasoning tasks (e.g., solving subgoals), while later layers overwrite or refine answers; deleting early layers is more damaging than deleting redundant middle blocks (~2024–2025).
• Neural networks decompose compositional tasks into modular subnetworks; ablating one subnetwork knocks out only its function, leaving others intact (~2023).
• Depth exhibits *critical thresholds*: in self-supervised RL, walking emerges at depth 16, wall-climbing at depth 256, with capabilities switching on discontinuously rather than smoothly; crossing below a threshold *deletes* a capability cliff-edge, not merely dims it (~2025).

Anchor papers (verify; mind their dates):
• arXiv:2301.10884 (2023): Break It Down — structural compositionality and modularity.
• arXiv:2402.14905 (2024): MobileLLM — sub-billion parameter pruning and depth effects.
• arXiv:2503.14858 (2025): 1000 Layer Networks — critical thresholds in depth scaling.
• arXiv:2412.04537 (2024): Understanding Hidden Computations — layer-wise reasoning roles.

Your task:
(1) RE-TEST the "graceful degradation" claim. For transformer and ResNet families: has recent scaling (>70B param), training method (DPO, RL), or structured pruning (block removal harnesses, LoRA-aware deletion) changed when or how robustness breaks? Does the residual-stream model still hold at extreme depth or under compute constraints (mobile, inference)? Separate the durable principle (residual pathways enable incremental removal) from the perishable boundary (how many layers can you remove before threshold crossing).

(2) Surface the strongest *disagreement*: does the modularity story (arXiv:2301.10884, 2023) hold under adversarial or out-of-distribution removal, or do later papers (2024–2025) show layer interactions that contradict clean decomposition? Name any work in the last 6 months that revises the threshold model (arXiv:2503.14858).

(3) Propose two research questions that assume the regime has shifted:
  – If depth thresholds are sharp, what training procedure (curriculum, multi-task, synthetic scaffolding) *smooths* the cliff, making depth a dial instead of a switch?
  – Under which architectural or task constraints does the residual-stream graceful-degradation story *fail*, and what replaces it?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines