INQUIRING LINE

Inquiring lines›Where does language-model reasonin…›How do language models represent m…›Do language models develop causal…›this inquiring line

There may be a measurable law governing how AI models learn — memorize up to a hard limit, then suddenly generalize.

What empirical evidence supports the Learning Law on real language models?

This reads as a question about whether there's a measured, law-like regularity governing how real language models learn — and the corpus doesn't contain a paper named 'the Learning Law,' so I'll treat it as: what hard empirical evidence exists for lawful, predictable patterns in how LLMs actually learn?

This explores whether real language models obey any law-like learning regularity with empirical backing — and here it's worth being direct: the collection has no paper called 'the Learning Law,' so if you have a specific named result in mind, it isn't in this corpus. What the corpus does have is something more useful for a curious reader: several pieces of measured evidence that learning in LLMs follows predictable, quantifiable patterns rather than magic.

The sharpest example is a measured phase transition. One study finds GPT-family models memorize their training data up to a fixed capacity — roughly 3.6 bits per parameter — and only once that capacity fills does 'grokking' kick in, the abrupt switch from memorizing to genuinely generalizing When do language models stop memorizing and start generalizing?. That's about as close to a 'learning law' as the corpus offers: a number you can measure, a threshold you can predict, and a regime change you can watch happen.

But the corpus also documents where learning is lawfully bounded — limits that hold no matter how you train. Self-improvement is formally capped by a 'generation-verification gap': a model can't reliably fix itself without something external to check the fix, so metacognition alone hits a hard ceiling What stops large language models from improving themselves?. And what looks like learned reasoning often isn't: RL-fine-tuned models (even GRPO-trained ones) collapse on slightly out-of-distribution variants, revealing that the 'learning' sharpened memorized templates rather than installing a procedure Do fine-tuned language models actually learn optimization procedures?. A related result shows models reason by semantic association, not symbolic logic — decouple meaning from the rules and performance falls apart, so what's learned is tied to training-distribution semantics Do large language models reason symbolically or semantically?.

There's also empirical evidence that learning behavior is *predictable from first principles*. Treating an LLM as an autoregressive probability machine lets researchers forecast which tasks it will fail — low-probability targets like reversing the alphabet — before running them, and the predictions held Can we predict where language models will fail?. On the architecture side, scaling isn't a single uniform law either: for sub-billion-parameter models, depth beats width, contradicting the balanced-scaling prescription Does depth matter more than width for tiny language models?, while latent-thought models open scaling dimensions entirely separate from parameter count Can latent thought vectors scale language models beyond parameters?.

The thing you didn't know you wanted to know: the most law-like, reproducible learning result in this collection isn't about getting smarter at all — it's the memorization-capacity threshold, a fixed per-parameter constant that has to *fill up* before generalization can even begin. Learning here looks less like steady improvement and more like a container filling until it tips over.

Sources 7 notes

When do language models stop memorizing and start generalizing?

GPT-family models have a measurable memorization capacity of approximately 3.6 bits-per-parameter. When this capacity fills, a phase transition triggers grokking—the shift from memorization to genuine generalization. This capacity is a property of individual models, not training algorithms.

What stops large language models from improving themselves?

Self-improvement in LLMs is formally bounded by the generation-verification gap, meaning every reliable fix requires something external to validate and enforce it. Models cannot escape this constraint through metacognition alone.

Do fine-tuned language models actually learn optimization procedures?

Even GRPO-trained models show sharp performance drops on out-of-distribution variants (N-1 test sets) compared to in-distribution problems, indicating RL optimizes template-matching rather than genuine problem-solving procedures.

Do large language models reason symbolically or semantically?

When semantic content is decoupled from reasoning tasks, LLM performance collapses even with correct rules in context. Models rely on parametric commonsense and token associations rather than formal logical manipulation, constraining reasoning to training distribution semantics.

Can we predict where language models will fail?

By framing LLMs as autoregressive probability machines, researchers predicted tasks with low-probability target responses would be systematically harder, even when logically simple. Experiments confirmed predictions like backwards alphabet and letter counting.

Show all 7 sources

Does depth matter more than width for tiny language models?

MobileLLM shows deep-and-thin architectures yield 2.7–4.3% accuracy gains over balanced designs at 125M–350M scale by composing abstract concepts through layers rather than spreading parameters across width.

Can latent thought vectors scale language models beyond parameters?

Latent-Thought Language Models achieve superior sample and parameter efficiency by coupling fast local variational learning with slow global decoder learning. This dual-rate scheme scales few-shot reasoning across both model and latent size, creating independent scaling dimensions beyond traditional parameter scaling.

Papers this line draws on 8

The research behind the notes this line reads — ranked by how closely each paper relates.

Beyond Accuracy: Evaluating the Reasoning Behavior of Large Language Models -- A Survey1.73 match · arxiv ↗
Large Language Diffusion Models1.70 match · arxiv ↗
Bigger is not always better: The importance of human-scale language modeling for psycholinguistics1.69 match · arxiv ↗
Scaling Laws for Neural Language Models1.68 match · arxiv ↗
Scalable Language Models with Posterior Inference of Latent Thought Vectors0.93 match · arxiv ↗
Mind the Gap: Examining the Self-Improvement Capabilities of Large Language Models0.90 match · arxiv ↗
How much do language models memorize?0.89 match · arxiv ↗
Echo Chamber: RL Post-training Amplifies Behaviors Learned in Pretraining0.88 match · arxiv ↗

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst re-testing claims about learning regularity in language models. The question remains open: do real LLMs obey predictable, law-like learning patterns—and if so, which constraints have since loosened?

What a curated library found—and when (dated claims, not current truth): Findings span 2023–2025.
• Memorization saturates at ~3.6 bits/parameter before grokking (generalization) begins—a hard phase transition (~2025).
• Self-improvement hits a generation-verification gap: models cannot reliably self-correct without external validation (~2024).
• RL fine-tuning (GRPO, etc.) resharpens memorized templates rather than installing robust procedures; out-of-distribution variants collapse (~2025).
• LLMs reason by semantic association, not symbolic logic; decouple meaning from rules and performance degrades (~2023).
• Task failure is predictable from autoregressive-loss first principles before empirical test (~2023).

Anchor papers (verify; mind their dates):
• arXiv:2305.14825 (2023-05): In-Context Semantic Reasoners
• arXiv:2412.02674 (2024-12): Mind the Gap (self-improvement limits)
• arXiv:2504.07912 (2025-04): Echo Chamber (RL amplifies pretraining)
• arXiv:2505.24832 (2025-05): Memorization capacity thresholds

Your task:
(1) **RE-TEST EACH CONSTRAINT.** For memorization capacity, self-improvement ceilings, RL-template collapse, and semantic-association bounds: have newer training paradigms (constitutional AI, mixture-of-experts, test-time scaling, retrieval-augmented generation), model scale, or better evals since *relaxed* or *overturned* any? Separate the durable question (likely still open: what is learnable?) from the perishable limitation (possibly resolved: only memorization was learnable). Cite what resolved it.
(2) **Surface the strongest CONTRADICTING or SUPERSEDING work** from the last ~6 months—especially anything claiming robust generalization, symbolic reasoning, or self-correction *without* external validation.
(3) **Propose 2 research questions** that assume the regime may have shifted: e.g., does post-pretraining intervention (instruction-tuning, RL, or agent scaffolding) break the memorization-phase-transition model? Can latent-thought models escape the semantic-association bound?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

There may be a measurable law governing how AI models learn — memorize up to a hard limit, then suddenly generalize.

Related lines of inquiry

Sources 7 notes

Papers this line draws on 8