What capabilities actually require massive scale versus specialized training regimes?
This explores which AI abilities genuinely depend on sheer model size versus which are better produced (or even only producible) through targeted training methods like fine-tuning, RL, or smarter data — and the corpus suggests the line falls between deep capabilities and surface behaviors, not where you'd expect.
Read as a question about where scale is load-bearing versus where it's wasted, the corpus draws a surprisingly clean line: scale matters most for the deep substrate — factual knowledge and genuine reasoning — while behaviors, style, and helpfulness are cheaply purchased through specialized training. A 12-skill decomposition found that reasoning and knowledge keep improving as models grow, but metacognition saturates around 7B parameters and stylistic competence plateaus even earlier; tellingly, smaller open models can imitate the *form* of a frontier model's answers while failing at the reasoning underneath Do all AI skills improve equally as models scale?. That split is echoed at the training-regime level: scaling pretraining buys factuality (knowledge stored in lower layers), while scaling fine-tuning buys helpfulness (behavior expressed in upper layers) — two independent dials, not one Do pretraining and fine-tuning scale independently in language models?.
But before crediting scale with too much, it's worth knowing that some of its most famous gifts may be illusions. The dramatic 'emergent abilities' that seem to switch on at a certain size largely vanish when you measure with continuous metrics instead of pass/fail ones — the underlying improvement was smooth all along, and the cliff was an artifact of how we scored it Are LLM emergent abilities real or measurement artifacts?. So the question 'what requires massive scale?' partly dissolves into 'what did we only *think* required scale because of our yardstick?'
The more interesting frontier is what specialized training can and can't manufacture. Reinforcement learning turns out to be domain-conditional: for standard reasoning it mostly *activates* abilities already latent in the base model, but for complex multi-step planning it can generate genuinely novel strategies the base model can't reach even with heavy sampling Does reinforcement learning create new reasoning abilities or activate existing ones?. That maps onto a consistent two-phase dynamic — RL first consolidates procedural execution, then hits a wall where strategic planning becomes the bottleneck Does RL training follow a predictable two-phase learning sequence?. In other words, specialized regimes can *create* the high-level planning capability that scale alone leaves dormant.
The catch is that specialization is a narrowing tool, not an additive one. Domain fine-tuning reliably raises in-domain accuracy while degrading general reasoning quality, and RL improves domain reasoning by *pruning* rather than adding — every technique has a sweet spot past which it hurts How do you add domain expertise without losing general reasoning?. RL also quietly collapses the diversity of formats a pretrained model knows, converging on a single dominant style within the first epoch Does RL training collapse format diversity in pretrained models?, and pushing too-hard training samples actively corrupts existing skills by rewarding degenerate shortcuts Do overly hard RLVR samples actually harm model capabilities?. Order matters too: training structured tasks before creative ones prevents entropy collapse from damaging open-ended ability Does training order reshape how models handle different task types?.
The quiet punchline is that the scale-versus-regime tradeoff isn't fixed — clever training can substitute for raw size. Augmenting pretraining data with generated reasoning traces delivers a 3x data-efficiency gain and gives a 3B model an outsized reasoning bump, essentially importing test-time compute scaling into training Can training data augmentation match test-time compute scaling benefits?. And once you leave the lab, raw capability stops being the constraint entirely: agentic systems with strong benchmarks complete only ~30% of real workplace tasks, where standardization, trust, and interaction design decide success What breaks when specialized AI models reach real users?. So the honest answer is that deep knowledge and reasoning depth track scale, planning and behavior are forgeable through specialized regimes — and beyond a point, neither is what's actually holding the system back.
Sources 11 notes
FLASK's 12-skill decomposition reveals metacognition saturates at 7B parameters while logical efficiency plateaus at 30B, but reasoning and knowledge skills improve continuously. Open-source models successfully imitate surface-level style but fail at reasoning—confirming that distillation copies form not substance.
Emulated Fine-Tuning reveals that scaling pretraining improves factual knowledge while scaling fine-tuning improves behavioral helpfulness. This decoupling has architectural roots: pretraining enriches lower-layer knowledge storage, while fine-tuning modifies upper-layer behavior expression.
Sharp, unpredictable capability transitions vanish when using continuous metrics instead of discontinuous ones. The same model outputs show smooth predictable improvement with scale, suggesting emergence is a measurement choice rather than a real behavioral change.
For standard reasoning tasks, RL activates latent abilities already present in base models. For complex planning requiring multi-step coordination, RL generates genuinely novel strategies inaccessible to base models even with extensive sampling.
Across eight models, RL training consistently shows a first phase where execution correctness drives learning, followed by a second phase where strategic planning becomes the bottleneck. Planning token entropy increases while execution entropy stabilizes, and concentration of optimization on planning tokens yields significant performance gains.
SFT raises domain accuracy but reduces reasoning quality by 38% InfoGain loss. RL improves domain reasoning by pruning rather than adding capability. Every technique has a domain-specific sweet spot beyond which performance degrades.
Controlled experiments show RL consistently amplifies one format distribution from pretraining within the first epoch while collapsing alternatives. The winning format depends on model scale, not necessarily performance, and is largely hidden when starting from proprietary pretrained models.
Training on nearly-impossible problems causes models to learn degenerate shortcuts rather than genuine reasoning, and these shortcuts contaminate pre-existing capabilities. Group-relative normalization treats rare accidental successes as high-advantage trajectories, reinforcing answer repetition and computation-skipping instead of sound reasoning patterns.
Omni-Thinker shows structured domains decrease output entropy while creative domains increase it. BWT-guided scheduling—training structured tasks first—yields 6.2% gains over joint training by preventing entropy collapse from damaging open-ended capabilities.
Augmenting pretraining data with LLM-generated reasoning traces improves data efficiency 3x and reasoning benchmark performance 10%+ for 3B models. Harder tokens automatically receive longer traces, creating a natural compute-allocation mechanism analogous to test-time scaling.
Agentic systems complete only 30% of real workplace tasks despite strong capability, while routing decisions outperform individual frontier models and generative interfaces outperform chat 70% of the time. Success depends on standardization, trust, and interaction design as much as raw model performance.