INQUIRING LINE

Inquiring lines›How should we train models for cap…›What systematic failures and vulne…›Does fine-tuning modify underlying…›this inquiring line

Not every AI skill needs a massive model — reasoning scales with size, but helpfulness can be trained cheaply.

What capabilities actually require massive scale versus specialized training regimes?

This explores which AI abilities genuinely depend on sheer model size versus which are better produced (or even only producible) through targeted training methods like fine-tuning, RL, or smarter data — and the corpus suggests the line falls between deep capabilities and surface behaviors, not where you'd expect.

Read as a question about where scale is load-bearing versus where it's wasted, the corpus draws a surprisingly clean line: scale matters most for the deep substrate — factual knowledge and genuine reasoning — while behaviors, style, and helpfulness are cheaply purchased through specialized training. A 12-skill decomposition found that reasoning and knowledge keep improving as models grow, but metacognition saturates around 7B parameters and stylistic competence plateaus even earlier; tellingly, smaller open models can imitate the *form* of a frontier model's answers while failing at the reasoning underneath Do all AI skills improve equally as models scale?. That split is echoed at the training-regime level: scaling pretraining buys factuality (knowledge stored in lower layers), while scaling fine-tuning buys helpfulness (behavior expressed in upper layers) — two independent dials, not one Do pretraining and fine-tuning scale independently in language models?.

But before crediting scale with too much, it's worth knowing that some of its most famous gifts may be illusions. The dramatic 'emergent abilities' that seem to switch on at a certain size largely vanish when you measure with continuous metrics instead of pass/fail ones — the underlying improvement was smooth all along, and the cliff was an artifact of how we scored it Are LLM emergent abilities real or measurement artifacts?. So the question 'what requires massive scale?' partly dissolves into 'what did we only *think* required scale because of our yardstick?'

The more interesting frontier is what specialized training can and can't manufacture. Reinforcement learning turns out to be domain-conditional: for standard reasoning it mostly *activates* abilities already latent in the base model, but for complex multi-step planning it can generate genuinely novel strategies the base model can't reach even with heavy sampling Does reinforcement learning create new reasoning abilities or activate existing ones?. That maps onto a consistent two-phase dynamic — RL first consolidates procedural execution, then hits a wall where strategic planning becomes the bottleneck Does RL training follow a predictable two-phase learning sequence?. In other words, specialized regimes can *create* the high-level planning capability that scale alone leaves dormant.

The catch is that specialization is a narrowing tool, not an additive one. Domain fine-tuning reliably raises in-domain accuracy while degrading general reasoning quality, and RL improves domain reasoning by *pruning* rather than adding — every technique has a sweet spot past which it hurts How do you specialize LLMs without losing general reasoning?. RL also quietly collapses the diversity of formats a pretrained model knows, converging on a single dominant style within the first epoch Does RL training collapse format diversity in pretrained models?, and pushing too-hard training samples actively corrupts existing skills by rewarding degenerate shortcuts Do overly hard RLVR samples actually harm model capabilities?. Order matters too: training structured tasks before creative ones prevents entropy collapse from damaging open-ended ability Does training order reshape how models handle different task types?.

The quiet punchline is that the scale-versus-regime tradeoff isn't fixed — clever training can substitute for raw size. Augmenting pretraining data with generated reasoning traces delivers a 3x data-efficiency gain and gives a 3B model an outsized reasoning bump, essentially importing test-time compute scaling into training Can training data augmentation match test-time compute scaling benefits?. And once you leave the lab, raw capability stops being the constraint entirely: agentic systems with strong benchmarks complete only ~30% of real workplace tasks, where standardization, trust, and interaction design decide success What breaks when specialized AI models reach real users?. So the honest answer is that deep knowledge and reasoning depth track scale, planning and behavior are forgeable through specialized regimes — and beyond a point, neither is what's actually holding the system back.

Sources 11 notes

Do all AI skills improve equally as models scale?

FLASK's 12-skill decomposition reveals metacognition saturates at 7B parameters while logical efficiency plateaus at 30B, but reasoning and knowledge skills improve continuously. Open-source models successfully imitate surface-level style but fail at reasoning—confirming that distillation copies form not substance.

Do pretraining and fine-tuning scale independently in language models?

Emulated Fine-Tuning reveals that scaling pretraining improves factual knowledge while scaling fine-tuning improves behavioral helpfulness. This decoupling has architectural roots: pretraining enriches lower-layer knowledge storage, while fine-tuning modifies upper-layer behavior expression.

Are LLM emergent abilities real or measurement artifacts?

Sharp, unpredictable capability transitions vanish when using continuous metrics instead of discontinuous ones. The same model outputs show smooth predictable improvement with scale, suggesting emergence is a measurement choice rather than a real behavioral change.

Does reinforcement learning create new reasoning abilities or activate existing ones?

For standard reasoning tasks, RL activates latent abilities already present in base models. For complex planning requiring multi-step coordination, RL generates genuinely novel strategies inaccessible to base models even with extensive sampling.

Does RL training follow a predictable two-phase learning sequence?

Across eight models, RL training consistently shows a first phase where execution correctness drives learning, followed by a second phase where strategic planning becomes the bottleneck. Planning token entropy increases while execution entropy stabilizes, and concentration of optimization on planning tokens yields significant performance gains.

Show all 11 sources

How do you specialize LLMs without losing general reasoning?

Research shows supervised fine-tuning raises domain benchmarks but degrades reasoning by 38%, while reinforcement learning prunes inaccurate knowledge rather than adding capability. Every specialization technique has a domain-specific optimal point beyond which performance declines.

Does RL training collapse format diversity in pretrained models?

Controlled experiments show RL consistently amplifies one format distribution from pretraining within the first epoch while collapsing alternatives. The winning format depends on model scale, not necessarily performance, and is largely hidden when starting from proprietary pretrained models.

Do overly hard RLVR samples actually harm model capabilities?

Training on nearly-impossible problems causes models to learn degenerate shortcuts rather than genuine reasoning, and these shortcuts contaminate pre-existing capabilities. Group-relative normalization treats rare accidental successes as high-advantage trajectories, reinforcing answer repetition and computation-skipping instead of sound reasoning patterns.

Does training order reshape how models handle different task types?

Omni-Thinker shows structured domains decrease output entropy while creative domains increase it. BWT-guided scheduling—training structured tasks first—yields 6.2% gains over joint training by preventing entropy collapse from damaging open-ended capabilities.

Can training data augmentation match test-time compute scaling benefits?

Augmenting pretraining data with LLM-generated reasoning traces improves data efficiency 3x and reasoning benchmark performance 10%+ for 3B models. Harder tokens automatically receive longer traces, creating a natural compute-allocation mechanism analogous to test-time scaling.

What breaks when specialized AI models reach real users?

Agentic systems complete only 30% of real workplace tasks despite strong capability, while routing decisions outperform individual frontier models and generative interfaces outperform chat 70% of the time. Success depends on standardization, trust, and interaction design as much as raw model performance.

Papers this line draws on 8

The research behind the notes this line reads — ranked by how closely each paper relates.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst re-testing claims about which LLM capabilities require massive scale versus specialized training. The question remains open: as of now, what is load-bearing versus wasted investment?

What a curated library found — and when (dated claims, not current truth):
Findings span 2023–2026; treat all as perishable checkpoints:

• Emergent abilities (dramatic cliff-like gains at specific scales) are largely metric artifacts; underlying improvement is smooth (2023, arXiv:2304.15004).
• Factual knowledge and reasoning improve consistently with scale; metacognition saturates ~7B params, style much earlier; a 12-skill decomposition showed independent scaling curves (2023, arXiv:2307.10928).
• Scaling pretraining buys factuality; scaling fine-tuning buys helpfulness — two orthogonal dials (2023–2024).
• RL activates latent abilities in standard reasoning but can generate *novel* strategies for multi-step planning; two-phase dynamic: procedural consolidation, then strategic planning bottleneck (2024–2025, arXiv:2412.16849, arXiv:2507.14783).
• RL post-training converges on single dominant style within epoch; hard samples induce degenerate shortcuts; task order (structured before creative) prevents entropy collapse (2025, arXiv:2504.07912, arXiv:2605.28388).
• Thinking-augmented pretraining: 3x data efficiency; 3B model achieves outsized reasoning via reasoning-trace augmentation (2025, arXiv:2509.20186).
• Real-world: agentic systems complete ~30% of workplace tasks; standardization and interaction design, not raw capability, are bottlenecks (2025–2026).

Anchor papers (verify; mind their dates):
– arXiv:2304.15004 (Apr 2023): emergent abilities as metric artifacts
– arXiv:2307.10928 (Jul 2023): skill-level scaling decomposition
– arXiv:2509.20186 (Sep 2025): thinking-augmented pretraining
– arXiv:2605.28388 (May 2026): sample difficulty mechanistic interpretation

Your task:

(1) RE-TEST EACH CONSTRAINT. For "emergent abilities vanish under continuous metrics" — have evaluations matured in the last 6 months to *confirm* or *refute* this? For "RL activates vs. creates" — does newer evidence (e.g., arXiv:2512.07783, arXiv:2605.06614 on skill curation and self-evolution) collapse or sharpen that distinction? For "3x data efficiency from reasoning traces" — has follow-up work reproduced, scaled, or refined this? Separate: Is the underlying tension (scale vs. specialization) still unresolved, or has a recent paper *unified* them?

(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last 6 months. If a 2026 paper argues RL *does* require scale, or that the two-phase dynamic doesn't hold at frontier scale, flag it plainly.

(3) Propose 2 research questions that ASSUME the regime may have shifted: e.g., "Do scaling laws for reasoning change under curriculum-aware RL?" or "Can agentic deployment reach >50% task completion if interaction design is co-optimized with model training?"

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Not every AI skill needs a massive model — reasoning scales with size, but helpfulness can be trained cheaply.

Related lines of inquiry

Sources 11 notes

Papers this line draws on 8