What empirical evidence supports the Learning Law on real language models?
This reads as a question about whether there's a measured, law-like regularity governing how real language models learn — and the corpus doesn't contain a paper named 'the Learning Law,' so I'll treat it as: what hard empirical evidence exists for lawful, predictable patterns in how LLMs actually learn?
This explores whether real language models obey any law-like learning regularity with empirical backing — and here it's worth being direct: the collection has no paper called 'the Learning Law,' so if you have a specific named result in mind, it isn't in this corpus. What the corpus does have is something more useful for a curious reader: several pieces of measured evidence that learning in LLMs follows predictable, quantifiable patterns rather than magic.
The sharpest example is a measured phase transition. One study finds GPT-family models memorize their training data up to a fixed capacity — roughly 3.6 bits per parameter — and only once that capacity fills does 'grokking' kick in, the abrupt switch from memorizing to genuinely generalizing When do language models stop memorizing and start generalizing?. That's about as close to a 'learning law' as the corpus offers: a number you can measure, a threshold you can predict, and a regime change you can watch happen.
But the corpus also documents where learning is lawfully bounded — limits that hold no matter how you train. Self-improvement is formally capped by a 'generation-verification gap': a model can't reliably fix itself without something external to check the fix, so metacognition alone hits a hard ceiling What stops large language models from improving themselves?. And what looks like learned reasoning often isn't: RL-fine-tuned models (even GRPO-trained ones) collapse on slightly out-of-distribution variants, revealing that the 'learning' sharpened memorized templates rather than installing a procedure Do fine-tuned language models actually learn optimization procedures?. A related result shows models reason by semantic association, not symbolic logic — decouple meaning from the rules and performance falls apart, so what's learned is tied to training-distribution semantics Do large language models reason symbolically or semantically?.
There's also empirical evidence that learning behavior is *predictable from first principles*. Treating an LLM as an autoregressive probability machine lets researchers forecast which tasks it will fail — low-probability targets like reversing the alphabet — before running them, and the predictions held Can we predict where language models will fail?. On the architecture side, scaling isn't a single uniform law either: for sub-billion-parameter models, depth beats width, contradicting the balanced-scaling prescription Does depth matter more than width for tiny language models?, while latent-thought models open scaling dimensions entirely separate from parameter count Can latent thought vectors scale language models beyond parameters?.
The thing you didn't know you wanted to know: the most law-like, reproducible learning result in this collection isn't about getting smarter at all — it's the memorization-capacity threshold, a fixed per-parameter constant that has to *fill up* before generalization can even begin. Learning here looks less like steady improvement and more like a container filling until it tips over.
Sources 7 notes
GPT-family models have a measurable memorization capacity of approximately 3.6 bits-per-parameter. When this capacity fills, a phase transition triggers grokking—the shift from memorization to genuine generalization. This capacity is a property of individual models, not training algorithms.
Self-improvement in LLMs is formally bounded by the generation-verification gap, meaning every reliable fix requires something external to validate and enforce it. Models cannot escape this constraint through metacognition alone.
Even GRPO-trained models show sharp performance drops on out-of-distribution variants (N-1 test sets) compared to in-distribution problems, indicating RL optimizes template-matching rather than genuine problem-solving procedures.
When semantic content is decoupled from reasoning tasks, LLM performance collapses even with correct rules in context. Models rely on parametric commonsense and token associations rather than formal logical manipulation, constraining reasoning to training distribution semantics.
By framing LLMs as autoregressive probability machines, researchers predicted tasks with low-probability target responses would be systematically harder, even when logically simple. Experiments confirmed predictions like backwards alphabet and letter counting.
MobileLLM shows deep-and-thin architectures yield 2.7–4.3% accuracy gains over balanced designs at 125M–350M scale by composing abstract concepts through layers rather than spreading parameters across width.
Latent-Thought Language Models achieve superior sample and parameter efficiency by coupling fast local variational learning with slow global decoder learning. This dual-rate scheme scales few-shot reasoning across both model and latent size, creating independent scaling dimensions beyond traditional parameter scaling.