INQUIRING LINE

What structural differences emerge between early generic skills and later meta-strategy skills?

This explores how the 'early skills' a model masters first (the nuts and bolts of executing a procedure correctly) differ in shape and behavior from the 'later skills' it develops (choosing a strategy, planning, deciding when to do what) — and what the corpus says about why these are structurally distinct rather than just harder versions of the same thing.


This explores how early, generic execution skills differ structurally from the later meta-strategy skills a model develops — and the corpus points to a fairly consistent picture: they aren't a smooth continuum, they're two regimes with different mechanics. The clearest evidence comes from a study of RL training across eight models, which found a two-phase dynamic: a first phase where getting the execution right is the bottleneck, followed by a second phase where strategic planning becomes the thing that actually limits performance Does RL training follow a predictable two-phase learning sequence?. The structural tell is in the entropy: planning tokens stay high-entropy and keep exploring, while execution tokens settle down and stabilize. So 'early generic skill' looks like convergence (one right way to execute a step), and 'later meta-strategy skill' looks like sustained branching (many possible plans, the model keeps its options open).

That split shows up again when you decompose skills and watch how they scale. A 12-skill breakdown found that metacognition-style skills saturate early — around 7B parameters — while logical reasoning keeps improving well past 30B Do all AI skills improve equally as models scale?. In other words, different skill families have different growth curves, which is exactly what you'd expect if they're structurally distinct rather than one capability scaled up. The same note makes a sharper point: smaller models can imitate surface *style* convincingly but fail at *reasoning* — distillation copies the form, not the substance. A separate finding sharpens the knife: chains of thought built from logically *invalid* steps perform almost as well as valid ones, because the model is learning the *shape* of reasoning, not genuine inference Does logical validity actually drive chain-of-thought gains?. Early skills are often this kind of learned form; meta-strategy is where the structural content has to actually be there.

Here's the part you might not expect: several notes argue the meta-strategy layer isn't really *built* during training at all — it's *selected*. Base models already carry latent reasoning strategies in their activations, and post-training mostly elicits what's there rather than creating it Do base models already contain hidden reasoning ability?. Pushed further, one analysis frames RL post-training as teaching a model *when* to reason, not *how* — the strategies pre-exist as activation vectors, and training optimizes deployment timing Does RL post-training create reasoning or just deploy it?. If that's right, the structural difference between early and late skills is partly a difference between *acquiring* a procedure and *learning to route to* a strategy you already had.

The meta-strategy layer also has a distinctive geometry. One method shows reasoning works best when exploration goes breadth-first through diverse abstractions rather than drilling depth-first down a single chain — depth-only reasoning hits an 'underthinking' failure mode that structured breadth avoids Can abstractions guide exploration better than depth alone?. That maps cleanly onto the entropy story: meta-strategy *is* the breadth, the deliberate keeping-open of multiple plans. And the deep substrate underneath both phases seems to be procedural knowledge — reasoning generalizes from broad, transferable procedural patterns picked up across many documents, unlike factual recall which leans on narrow memorization Does procedural knowledge drive reasoning more than factual retrieval?.

One last wrinkle worth knowing: training *order* mechanically reshapes which skills survive. Structured tasks pull output entropy down while creative tasks push it up, and scheduling structured-first can damage open-ended capability through entropy collapse Does training order reshape how models handle different task types?. So the early-generic-vs-late-meta distinction isn't only about what the model learns — it's about sequence. Consolidate the convergent execution skills too aggressively and you can crush the high-entropy exploration that meta-strategy depends on. The two regimes don't just differ; they can be in tension.


Sources 8 notes

Does RL training follow a predictable two-phase learning sequence?

Across eight models, RL training consistently shows a first phase where execution correctness drives learning, followed by a second phase where strategic planning becomes the bottleneck. Planning token entropy increases while execution entropy stabilizes, and concentration of optimization on planning tokens yields significant performance gains.

Do all AI skills improve equally as models scale?

FLASK's 12-skill decomposition reveals metacognition saturates at 7B parameters while logical efficiency plateaus at 30B, but reasoning and knowledge skills improve continuously. Open-source models successfully imitate surface-level style but fail at reasoning—confirming that distillation copies form not substance.

Does logical validity actually drive chain-of-thought gains?

Illogical chain-of-thought exemplars matched valid CoT performance on BIG-Bench Hard, showing that structural properties—not logical validity—drive the gains. The model learns the form of reasoning, not genuine inference.

Do base models already contain hidden reasoning ability?

Five independent mechanisms—RL steering, critique fine-tuning, decoding changes, SAE feature steering, and RLVR—all elicit reasoning already present in base model activations. Post-training selects rather than creates reasoning; the bottleneck is elicitation, not capability acquisition.

Does RL post-training create reasoning or just deploy it?

Evidence shows base models already contain reasoning capability in latent form; RL training optimizes deployment timing rather than capability creation. Hybrid models recover 91% of performance gains by routing tokens only, and activation vectors for reasoning strategies pre-exist before any RL.

Can abstractions guide exploration better than depth alone?

RLAD jointly trains abstraction and solution generators, showing that allocating test-time compute to diverse abstractions outperforms parallel solution sampling at large budgets. Abstractions create structured breadth-first exploration that prevents the underthinking failure mode of depth-only reasoning chains.

Does procedural knowledge drive reasoning more than factual retrieval?

Analysis of 5 million pretraining documents shows reasoning relies on broad, transferable procedural knowledge from diverse sources, unlike factual recall which depends on narrow, document-specific memorization of target facts.

Does training order reshape how models handle different task types?

Omni-Thinker shows structured domains decrease output entropy while creative domains increase it. BWT-guided scheduling—training structured tasks first—yields 6.2% gains over joint training by preventing entropy collapse from damaging open-ended capabilities.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst. The question remains open: **Do early generic skills and later meta-strategy skills represent structurally distinct regimes, or do they form a continuum?**

What a curated library found — and when (findings span 2023–2025; dated claims, not current truth):
- Two-phase RL dynamics: execution convergence (low entropy, one right way) precedes planning divergence (high entropy, sustained exploration) (~2025, arXiv:2507.14783).
- Metacognition saturates ~7B parameters; logical reasoning improves past 30B — different growth curves suggest structural distinctness (~2023, arXiv:2307.10928).
- Logically invalid chains-of-thought perform nearly as well as valid ones, implying early skills learn *form* not substance (~2023, arXiv:2307.10573).
- Base models possess latent reasoning strategies pre-training; post-training *selects* rather than builds them (~2025, arXiv:2507.04742).
- Breadth-first exploration across abstractions outperforms depth-first chains; meta-strategy *is* the deliberate branching (~2025, arXiv:2505.20296).

Anchor papers (verify; mind their dates):
- arXiv:2307.10573 (2023): Invalid logic equivalence in CoT prompting.
- arXiv:2307.10928 (2023): FLASK skill-set evaluation framework.
- arXiv:2507.14783 (2025): Omni-Thinker multi-task RL entropy dynamics.
- arXiv:2411.12580 (2024): Procedural knowledge driving reasoning.

Your task:
(1) **RE-TEST EACH CONSTRAINT.** For each finding above—especially the claim that early and late skills are *distinct regimes*—judge whether newer training methods (e.g., RLP as pretraining, arXiv:2510.01265), model scale, or evaluation harnesses have since blurred or sharpened the boundary. Does the entropy split still hold? Can smaller models now bridge the form–substance gap? Cite what moved the needle.
(2) **Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months.** Does any recent paper argue for a *unified* continuum rather than two regimes? Does activation steering (arXiv:2507.04742) or stepwise judging (arXiv:2508.19229) collapse the distinction?
(3) **Propose 2 research questions that assume the regime may have shifted:** e.g., "If RL-as-pretraining erases the two-phase dynamic, what replaces it?" or "Does instruction-tuning order now matter more than task composition for preserving meta-strategy entropy?"

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines