INQUIRING LINE

Why do naive pruning and quantization destroy LLM performance so easily?

This explores why simply chopping out weights (pruning) or lowering numerical precision (quantization) tends to wreck an LLM — and the corpus doesn't address compression methods head-on, so the honest answer is a lateral one built from what the collection *does* say about how models store and use information.


First, a caveat worth stating plainly: this collection has no paper directly testing pruning or quantization, so there's no retrieval that names your failure mode. But several notes circle the same underlying question — *how is capability actually distributed across an LLM's weights?* — and that's the thing naive compression collides with.

The sharpest angle comes from work framing learning itself as compression. One result derives optimal training from a lossless-compression objective and finds a 'Learning Law' where, in the optimal process, every training example contributes equally Does optimal language model learning maximize data compression?. Read that backwards and you get an intuition for why post-hoc compression is brutal: if a well-trained model has already squeezed its data near a compression limit with information spread evenly, there's little redundant slack left to cut. Naive pruning assumes some weights are 'spare.' A model that learned by maximizing compression has comparatively few spare weights to give.

The MobileLLM result pushes this further. At sub-billion scale, deep-and-thin architectures beat wide ones because capability comes from *composing* abstract concepts across layers, not from spreading parameters across width Does depth matter more than width for tiny language models?. If a capability is a chain through many layers rather than a localized lump, then knocking out weights or coarsening precision anywhere along the chain can break the whole composition — which is exactly the 'falls off a cliff' behavior naive compression produces. There's nothing graceful to degrade; you're snapping a link.

There's also a clue that models already manage their own sparsity dynamically. Hidden states sparsify in a localized, systematic way as tasks get harder or unfamiliar, and this acts as a *stabilizing* selective filter, not a defect Do language models sparsify their activations under difficult tasks?. That reframes the whole problem: the model already decides what to zero out, conditioned on the input. Naive pruning overrides that with a fixed, input-blind mask — you're freezing a decision the model needed to make per-example, which is most damaging precisely on the rare, hard, out-of-distribution inputs.

The quiet kicker is that these effects hide until they don't. Models can corrupt a quarter of a document over a long workflow without ever plateauing or signaling trouble Do frontier LLMs silently corrupt documents in long workflows?, and they routinely fail on low-probability targets that are logically trivial Can we predict where language models will fail?. A compressed model can look fine on common prompts and quietly collapse on the rare ones — the same long-tail where that adaptive sparsity and layer-wise composition mattered most. So the deeper lesson the corpus points to: 'destroys performance so easily' may be less about compression being violent and more about capability being more distributed, compositional, and input-conditional than a one-size mask or fixed bit-width can respect.


Sources 5 notes

Does optimal language model learning maximize data compression?

Research shows that optimal LM training can be derived from a lossless compression objective, yielding a Learning Law where all examples contribute equally in the optimal process. This approach improves scaling law coefficients, not just constants.

Does depth matter more than width for tiny language models?

MobileLLM shows deep-and-thin architectures yield 2.7–4.3% accuracy gains over balanced designs at 125M–350M scale by composing abstract concepts through layers rather than spreading parameters across width.

Do language models sparsify their activations under difficult tasks?

As task difficulty increases, LLM hidden states become substantially sparser in a localized, systematic way that correlates with task unfamiliarity and reasoning load. This sparsification acts as a selective filter stabilizing performance under OOD shift rather than a failure mode.

Do frontier LLMs silently corrupt documents in long workflows?

Testing 19 models across 52 domains shows even advanced systems degrade documents by ~25% over extended relay tasks, with errors compounding silently without plateauing through 50 round-trips.

Can we predict where language models will fail?

By framing LLMs as autoregressive probability machines, researchers predicted tasks with low-probability target responses would be systematically harder, even when logically simple. Experiments confirmed predictions like backwards alphabet and letter counting.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a researcher re-evaluating the claim that naive pruning and quantization destroy LLM performance because capability is distributed across weights rather than localized. A curated library (2023–2026) found—and when (dated claims, not current truth):

• Training via compression maximization leaves no redundant 'slack' to prune post-hoc; every weight already contributes equally to the compression objective (~2024).
• Sub-billion models encode capability through *layer-wise composition* of abstract concepts, not parameter width, so fixed-mask pruning snaps capability chains rather than shaving redundancy (~2024).
• Models dynamically sparsify hidden states per-input as an adaptive stabilizing filter; naive pruning replaces this input-conditional sparsity with a static, one-size mask (~2026).
• Compressed models silently corrupt low-probability targets and long-tail reasoning without signaling degradation on common prompts (~2026).

Anchor papers (verify; mind their dates): arXiv:2402.17759 (Towards Optimal Learning, 2024), arXiv:2402.14905 (MobileLLM, 2024), arXiv:2603.03415 (Farther the Shift, Sparser the Representation, 2026), arXiv:2604.15597 (LLMs Corrupt Your Documents, 2026).

Your task:
(1) RE-TEST EACH CONSTRAINT. For every finding above, judge whether newer inference engines (quantized SDKs, low-rank adapters, dynamic/adaptive compression, structured pruning), training methods (compression-aware pre-training, mixed-precision tuning), or evaluation harnesses (long-tail benchmarks, multi-turn fidelity metrics) have since *relaxed* or *overturned* it. Separate the durable question—why are LLMs sensitive to weight perturbation?—from the perishable claim—that this is because compression already saturates redundancy. Cite what resolved any constraint; say plainly where it still appears to hold.
(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months. Has anyone demonstrated graceful, lossless sub-4-bit quantization or 70% pruning on frontier models without cliff-drop?
(3) Propose 2 research questions that ASSUME the regime may have moved: e.g., *Can input-conditional, layer-wise compression masks trained end-to-end beat fixed pruning?* or *Do newer architectures (MoE, mixture of submodels) distribute capability differently, making them more robust to naive compression?*

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines