INQUIRING LINE

Inquiring lines›How do language models construct a…›How does AI persuasion undermine h…›What role does compression play in…›this inquiring line

Bigger AI models predict text better — but once you count the model itself as overhead, they actually start compressing data worse.

Why does adjusted compression performance degrade as models scale larger?

This explores why, once you count the cost of the model's own parameters as part of the compressed message, bigger models start compressing worse rather than better — the two-part code view of language-modeling-as-compression.

This question reads against the well-known framing that language modeling *is* compression: a model that predicts the next token well can drive an arithmetic coder to squeeze text (or even images and audio) into very few bits. The catch is what "adjusted" means. Raw compression keeps improving with scale, but *adjusted* compression charges you for transmitting the compressor itself — model plus encoded data, the classic minimum-description-length account. Under that accounting, a larger model has to justify every extra parameter with a matching reduction in the data's encoded size, and past a point it can't: the parameters grow faster than the marginal predictive gain, so the total message gets longer even as per-token prediction tightens. The corpus's anchor here is the result that text-trained Chinchilla models can out-compress PNG and FLAC purely through in-context adaptation Can text-trained models compress images better than specialized tools? — but that win is about *in-context* conditioning, not parameter count, which is exactly why the adjusted curve and the raw curve part ways.

What's quietly interesting is that the corpus keeps surfacing the same shape from unrelated directions: capability that flattens or even reverses with scale once you account for cost. LLMs plateau around 55–60% on genuine constraint-satisfaction tasks regardless of parameter count Do larger language models solve constrained optimization better?, and instruction-following degrades in predictable patterns that scale doesn't rescue How does instruction density affect model performance?. The adjusted-compression dip is the information-theoretic cousin of these ceilings — evidence that raw size buys less than the scaling-law story implies once you measure the right quantity.

There's also a representational reason bigger isn't freely better. LLMs already compress *aggressively* relative to humans, capturing broad category structure while discarding the fine-grained, situation-specific distinctions people preserve Do LLMs compress concepts more aggressively than humans do?. A model tuned to maximize statistical compression is, in a sense, already at the efficient frontier for the bulk of the distribution; extra parameters mostly chase the long tail, where each marginal bit saved costs disproportionately many parameters to store. That's the same economics that makes the adjusted curve turn.

The corpus also hints at where the gains have migrated. If parameters are an expensive way to buy the last bits, then architecture and inference compute become the cheaper levers. Depth-over-width at small scale composes abstract concepts through layers instead of paying for width Does depth matter more than width for tiny language models?, retrieval distributions can be folded into a small parametric decoder that preserves long-tail facts without a giant datastore Can retrieval knowledge compress into a tiny parametric model?, and test-time compute can substitute for parameter scaling outright on hard prompts Can inference compute replace scaling up model size?. Each is a way of getting compression-equivalent capability without paying the parameter tax that drags the adjusted curve down.

So the short answer: adjusted compression degrades at scale not because big models predict worse, but because you're now billing them for their own size, and the per-parameter return on prediction shrinks faster than the bill grows. The deeper takeaway — the thing you didn't know you wanted to know — is that this is the same wall showing up as constraint-satisfaction plateaus, instruction-following decay, and the recent pivot toward depth, retrieval distillation, and inference-time compute as the places where capability is now actually cheaper to buy.

Sources 7 notes

Can text-trained models compress images better than specialized tools?

Chinchilla models trained exclusively on text achieve better compression rates on images and audio than FLAC and PNG by using their context window to adapt as task-specific compressors. This demonstrates that generalization operates through compression, not specialization.

Do larger language models solve constrained optimization better?

Across constrained-optimization tasks, LLMs converge to ~55–60% constraint satisfaction independent of architecture, parameter count, or training regime. Reasoning models do not systematically outperform standard models, suggesting a fundamental ceiling rather than a scaling gap.

How does instruction density affect model performance?

IFScale benchmark shows three degradation patterns: linear (small models), exponential (mid-range), and threshold decay (reasoning models maintain ~150 instructions then fail steeply). Even best models reach only 68% accuracy at maximum density.

Do LLMs compress concepts more aggressively than humans do?

Using Rate-Distortion Theory on cognitive datasets, LLMs capture broad category structure but lose fine-grained distinctions humans preserve. LLMs maximize compression efficiency; humans trade compression for contextual meaning that enables situated action.

Does depth matter more than width for tiny language models?

MobileLLM shows deep-and-thin architectures yield 2.7–4.3% accuracy gains over balanced designs at 125M–350M scale by composing abstract concepts through layers rather than spreading parameters across width.

Show all 7 sources

Can retrieval knowledge compress into a tiny parametric model?

Memory Decoder successfully compresses kNN-LM retrieval distributions into a small transformer that plugs into any LLM via output interpolation. It preserves long-tail factual knowledge while maintaining semantic coherence, reducing perplexity by 6.17 points across domains.

Can inference compute replace scaling up model size?

Snell et al. (2024) showed that inference-time compute trades off against model parameter scaling, especially on difficult prompts. This reveals pretraining and inference compute are not independent resources.

Papers this line draws on 8

The research behind the notes this line reads — ranked by how closely each paper relates.

Scaling Laws Meet Model Architecture: Toward Inference-Efficient LLMs2.54 match · arxiv ↗
From Tokens to Thoughts: How LLMs and Humans Trade Compression for Meaning2.53 match · arxiv ↗
Thinking as Compression: Your Reasoning Model is Secretly a Context Compressor2.41 match · arxiv ↗
ZebraLogic: On the Scaling Limits of LLMs for Logical Reasoning1.71 match · arxiv ↗
CLaRa: Bridging Retrieval and Generation with Continuous Latent Reasoning1.64 match · arxiv ↗
Can Large Language Models Reason and Optimize Under Constraints?0.89 match · arxiv ↗
Generalization through Memorization: Nearest Neighbor Language Models0.88 match · arxiv ↗
SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training0.88 match · arxiv ↗

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are an LLM researcher evaluating whether adjusted compression constraints have shifted since late 2024. The question: Does adjusted compression performance (bits per token + model size cost) still degrade as models scale, or have newer architectures, training methods, inference strategies, or evaluation practices since relaxed this tradeoff?

What a curated library found — and when (dated claims, not current truth):
Findings span 2019–2026; note these are perishable claims:

• Raw compression improves with scale, but adjusted compression (parameter cost + encoded bits) reverses past a point; per-parameter return shrinks faster than the parameter bill grows (~2023–2024).
• Capability plateaus around 55–60% on constraint-satisfaction tasks *regardless of parameter count*, and instruction-following degrades predictably with density—both unrelated domains showing the same scaling limit (~2024–2025).
• LLMs already compress aggressively, discarding fine-grained distinctions humans preserve; extra parameters chase the long tail at disproportionate cost (~2025).
• Architecture and inference-time compute (depth over width, retrieval distillation, test-time reasoning) emerge as cheaper levers than parameter scaling (~2024–2025).
• Recent post-training methods (RL vs. SFT) and long-horizon reasoning tasks may reveal whether the adjusted curve has shifted (~2025–2026).

Anchor papers (verify; mind their dates):
• arXiv:2309.10668 (2023) — Language Modeling is Compression
• arXiv:2402.14905 (2024) — MobileLLM: sub-billion scaling
• arXiv:2505.17117 (2025) — Tokens to Thoughts: compression vs. meaning
• arXiv:2603.23004 (2026) — Reasoning and optimization under constraints

Your task:
(1) RE-TEST EACH CONSTRAINT. For every finding above, judge whether post-2024 models (e.g., o1, o3, newer dense/MoE variants), new training paradigms (RL post-training, test-time scaling), orchestration (multi-turn, tool use, agentic workflows), or evaluation shifts (long-horizon tasks, constraint reasoning) have since relaxed or overturned it. Separate the durable question (still open) from the perishable limitation (resolved). Cite what resolved it plainly.

(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months (late 2025 onward). Where does the adjusted-compression penalty *not* appear, or appear differently?

(3) Propose 2 research questions that assume the regime *has* moved—e.g., "Does test-time compute delegation (o3-style) make parameter scaling rational again under adjusted-compression accounting?" or "Can RL post-training escape the compression-vs.-generalization tradeoff?"

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Bigger AI models predict text better — but once you count the model itself as overhead, they actually start compressing data worse.

Related lines of inquiry

Sources 7 notes

Papers this line draws on 8