INQUIRING LINE

Why do optimal learning dynamics improve scaling law coefficients specifically?

This explores why the *way* a model is trained — its learning dynamics — shifts the constants inside a scaling law (moving the whole curve up), rather than simply requiring more data or parameters to climb the same curve.


This explores why optimizing how a model learns improves the *coefficients* of a scaling law — the multipliers and exponents that govern how loss falls as you add compute — instead of just buying more scale. The honest starting point: the corpus doesn't contain a single paper that proves this claim head-on, but several notes converge on the territory and explain the mechanism. The key reframe is that a scaling law isn't a law of nature. Its coefficients aren't fixed constants — they encode how efficiently a given training recipe converts compute into capability. Improve the dynamics and you change the constant, which looks like getting more out of the same budget.

The clearest lens comes from the idea that deep learning theory is consolidating around 'learning mechanics' — modeling training the way physics models gases, prioritizing average-case behavior, training dynamics, and aggregate statistics over worst-case bounds Can deep learning theory unify around training dynamics?. Under this frame, scaling coefficients stop being mysterious givens and become *outputs* of the trajectory a model takes through training. If the dynamics are the thing that produces the coefficient, then better dynamics — smoother optimization, less wasted movement, preserved capacity to keep learning — should produce a better coefficient.

You can see this concretely in work that folds architectural choices directly into the scaling law. By adding variables like hidden size, MLP-to-attention ratio, and attention grouping, models reorganize the *same* training budget into up to 2.1% higher accuracy and 42% more throughput than a strong baseline Can architecture choices improve inference efficiency without sacrificing accuracy?. Nothing was scaled up; the curve itself moved, because structural and training choices change the constant in front of it. That's the coefficient story in miniature.

The corpus also suggests *why* good dynamics matter so much: they protect the model's ability to keep improving. Staying close to the base distribution (low KL drift) preserves 'plasticity' — models trained that way keep adapting where parameter-only methods stall when the task shifts Does staying close to the base model preserve learning ability?. And reinforcement learning turns out to touch only 5–30% of parameters in structured, full-rank subnetworks rather than thrashing the whole network Does reinforcement learning update only a small fraction of parameters?. Both hint that 'optimal' dynamics are disciplined ones — they move the right things and leave capacity intact, which is exactly what keeps the scaling curve steep instead of flattening early.

The most surprising adjacent finding is that better learning schemes don't just shift one curve — they can open *new* scaling axes entirely. Latent-thought models couple fast local learning with slow global learning and get better sample and parameter efficiency, creating a scaling dimension independent of parameter count Can latent thought vectors scale language models beyond parameters?, and deep-research agents reveal a search-budget axis that scales like reasoning tokens do Do search steps follow the same scaling rules as reasoning tokens?. The takeaway a curious reader might not expect: 'improving the coefficients' and 'finding a new axis to scale on' are two faces of the same insight — that the dynamics of learning, not raw size, are what the scaling laws are quietly measuring.


Sources 6 notes

Can deep learning theory unify around training dynamics?

Research shows learning mechanics is consolidating as a unified frame for deep learning, modeled on classical and statistical mechanics. It prioritizes average-case predictions, training dynamics, and aggregate statistics over worst-case bounds, mirroring how physics addresses macroscopic systems.

Can architecture choices improve inference efficiency without sacrificing accuracy?

Augmenting scaling laws with hidden size, MLP-to-attention ratio, and GQA configuration enables architecture optimization for inference. Optimized models achieved up to 2.1% higher accuracy and 42% greater throughput than LLaMA-3.2 under identical training budgets.

Does staying close to the base model preserve learning ability?

FST-trained models stay up to 70% closer to their base distribution than parameter-only RL, and this reduced drift preserves the model's ability to learn subsequent tasks effectively. Parameter-only approaches stall when task domains change, while low KL drift enables sustained adaptation.

Does reinforcement learning update only a small fraction of parameters?

Across seven RL algorithms and ten LLM families, RL induces intrinsic parameter sparsity of 5–30% without explicit regularization. Critically, these sparse updates are nearly full-rank and nearly identical across random seeds, indicating structural rather than arbitrary parameter selection.

Can latent thought vectors scale language models beyond parameters?

Latent-Thought Language Models achieve superior sample and parameter efficiency by coupling fast local variational learning with slow global decoder learning. This dual-rate scheme scales few-shot reasoning across both model and latent size, creating independent scaling dimensions beyond traditional parameter scaling.

Do search steps follow the same scaling rules as reasoning tokens?

Deep research agents improve with more search steps in a pattern mirroring the reasoning-token relationship, with both exhibiting diminishing returns. This reveals a new inference-compute axis beyond model capability alone.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are an LLM researcher re-testing whether scaling law coefficients genuinely improve under optimized learning dynamics, or whether claimed improvements dissolve under newer methods, models, or evaluation.

What a curated library found — and when (dated claims, not current truth): Findings span 2023–2026, tracking learning mechanics as the frame that explains coefficient shifts.
• Scaling coefficients are outputs of training *dynamics*, not fixed constants; architectural + training choices moved accuracy +2.1% and throughput +42% in same budget (2025–10).
• Low KL drift from base preserves 'plasticity' — models keep adapting where parameter-only methods stall, maintaining steep scaling curves (2026–05).
• RL updates only 5–30% of parameters in structured, full-rank subnetworks, leaving capacity intact (2025–05).
• Latent-thought models introduce scaling axes independent of parameter count, decoupling fast local from slow global learning (2025–02).
• Deep-research agents follow search-budget scaling laws analogous to reasoning-token scaling (2026–06).

Anchor papers (verify; mind their dates):
• arXiv:2510.18245 (2025–10): Scaling Laws Meet Model Architecture
• arXiv:2505.11711 (2025–05): RL Finetunes Small Subnetworks
• arXiv:2605.12484 (2026–05): Learning, Fast and Slow
• arXiv:2506.18959 (2026–06): Deep Research & Agentic Search

Your task:
(1) RE-TEST EACH CONSTRAINT. For every finding above, judge whether newer models (o3, Claude 4, Gemini 3), training methods (DPO, online RL, mixture-of-experts), orchestration (multi-agent, persistent memory, tree-search), or evaluation harnesses have since relaxed or overturned it. Separate the durable claim — learning dynamics shape coefficients — from perishable limitations (e.g., specific plasticity thresholds, parameter sparsity bounds). Cite what resolved it; flag where constraints still hold.

(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months — papers arguing scaling coefficients are *NOT* primarily shaped by dynamics, or that newer scaling regimes bypass the coefficient story entirely.

(3) Propose 2 research questions that assume the regime may have moved: e.g., Do emergent multi-scale orchestration (nested agents, hierarchical memory) create *composite* scaling laws that make the single-coefficient model obsolete? Can inverse-scaling or phase-transition discoveries in 2026+ models show coefficient improvability has hard limits tied to model width or depth?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines