SYNTHESIS NOTE
Training, RL, and Test-Time Scaling Reasoning, Retrieval, and Evaluation Model Architecture and Internals

Can training data augmentation match test-time compute scaling benefits?

Can generating thinking trajectories during pretraining unlock the same efficiency gains that test-time scaling provides at inference? This explores whether the compute-allocation principle works across the training-inference boundary.

Synthesis note · 2026-02-22 · sourced from LLM Architecture
How should we allocate compute budget at inference time? What kind of thing is an LLM really? How should researchers navigate LLM reasoning research?

Thinking augmented Pre-Training (TPT, 2509.20186) introduces a simple insight: some valuable tokens are too hard to learn in a single next-token prediction step because they represent the output of complex multi-step human reasoning. Rather than modifying the architecture, TPT augments the training data itself — generating thinking trajectories using open-source LLMs and interleaving them with the original text.

The key finding: 3x improvement in data efficiency, with 10%+ gains on reasoning benchmarks for a 3B model. No architecture changes. No human annotation. The thinking trajectories simulate an expert's analysis of the text, decomposing hard tokens into learnable intermediate steps.

The mechanism has a natural self-organizing property. Thinking trajectories are longer for domains like mathematics where reasoning is more intensive. A positive correlation exists between reasoning intensity of the original text and thinking length. This means harder tokens automatically receive more training compute through longer trajectories — functioning as a natural up-sampling mechanism for high-value data.

This is the training-time analog of test-time scaling. Since Can inference compute replace scaling up model size?, TPT shows the same principle operates during training: allocate more compute to harder tokens. The difference is the intervention point — training rather than inference.

The connection to Can next-token prediction become a reasoning task with RL? is complementary. RPT changes the training objective (RL instead of NTP). TPT changes the training data (augmented with thinking). Both target the same problem — standard NTP is insufficient for learning complex reasoning from data — but intervene at different levels.

Since Do base models already contain hidden reasoning ability?, TPT provides a pretraining-time mechanism for strengthening these latent capabilities. The thinking trajectories may serve as the training-time equivalent of the "minimal signals" that activate reasoning — making reasoning patterns more available for later post-training to refine.

A notable finding: the model trained on augmented data can surpass the performance of the LLM that generated the thinking trajectories. Explanation is easier than generation from scratch, so the student benefits from the teacher's explanatory labor even when the teacher's own generation capabilities are limited.

Inquiring lines that use this note as a source 29

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 5

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map
19 direct connections · 205 in 2-hop network ·dense cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

thinking-augmented pre-training increases data efficiency 3x by applying test-time scaling principles at training time