Can embedding future information in training data improve planning?
This explores whether inserting lookahead tokens containing future goals into training sequences helps models learn long-range planning without changing their architecture. The question matters because it tests whether data-level changes can produce architectural-level reasoning improvements.
TRELAWNEY (2504.11336) identifies a structural mismatch in causal language model training: each token is predicted from previous context, but in human writing and reasoning, goals are typically known before exact arguments or phrasings. Teacher forcing compounds this — it accelerates training by providing correct previous output, but models trained this way latch onto local patterns and surface-level correlations rather than learning long-range dependencies.
The fix is data-centric rather than architectural. TRELAWNEY augments training data by interleaving special lookahead tokens (<T> and </T>) that encapsulate future information. The placement and content of these tokens can be random or task-specific. The model learns from modified training data using the standard training infrastructure — no architecture changes, no additional training tricks.
The results span planning, algorithmic reasoning, and story generation. The model's goal generation capability — a natural byproduct of the training augmentation — can further improve planning and reasoning when used at inference time. This training-time goal conditioning is the complement of Does planning direction affect how hard problems become?, which provides goal information at inference time by reversing search direction — TRELAWNEY internalizes backward planning's benefits during training.
This is a different intervention than multi-token prediction (Bachmann & Nagarajan, 2024; Gloeckle et al., 2024), which forces simultaneous prediction of multiple future tokens. Multi-token prediction modifies the training objective and often the architecture. TRELAWNEY modifies only the training data, making it compatible with existing infrastructure and scalable to any model size.
Since Does training data format shape reasoning strategy more than domain?, TRELAWNEY is evidence that format intervention at the training data level can have architectural-level effects. The lookahead tokens create a new "format" that teaches the model to condition generation on future goals — changing its reasoning strategy from purely autoregressive to goal-conditioned.
The connection to Can backward reasoning during training improve forward reasoning? is complementary: backward reasoning provides consistency checking from the end state, while lookahead tokens provide goal information from the future. Both address the forward-only limitation of standard NTP from different angles.
Inquiring lines that use this note as a source 22
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- Can explicit goal state scaffolding at inference time transfer to autonomous tracking through training?
- What distinguishes planning knowledge from an executable plan that works?
- What memory and planning capabilities do AI companions need for evolving user needs?
- How does token-by-token generation constrain a model's ability to plan ahead?
- How can diffusion models predict future tokens without completing prior blocks?
- Can architecture changes and early stopping combine to close the diffusion inference gap?
- Can episodic and semantic memory improve long-horizon task reasoning?
- How can weak-to-strong progressive training target planning without interfering with grounding?
- Can architectural changes like decoupling intent understanding help overcome next-turn reward limitations?
- Why do aha moments emerge specifically during the planning phase?
- Do thought anchors correspond mechanistically to planning tokens in RL?
- Can training models on backward reasoning improve their forward planning ability?
- Can LLM-synthesized behavioral heuristics compete with learned policy improvements?
- How does planning-before-execution compare to iterative reasoning and action loops?
- Can goal information injected at inference time replace goal-conditioned training?
- How does post-training shift models from passive prediction to on-policy action?
- What data properties enable transformers to learn sequential decision-making in context?
- How does predictive accuracy on future tokens differ from correctness on labeled answers?
- How do thought actions represent policy improvement steps in practice?
- Can standard next-token prediction capture complex multi-step human reasoning directly?
- Can backward planning reduce search difficulty when multiple goal state paths exist?
- How do compact latent dynamics enable planning without explicit chain of thought?
Related concepts in this collection 5
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Does training data format shape reasoning strategy more than domain?
What explains why models trained on multiple-choice data reason differently than those trained on free-form text? The research isolates format and domain effects to measure which one matters more.
data-level format intervention with architectural-level effects
-
Can backward reasoning during training improve forward reasoning?
Does training models to reason backward—generating inverse questions and solutions—build internal consistency checking that transfers to forward-only inference? This explores whether backward capacity internalized during training without test-time deployment can enhance reasoning quality.
complementary future-information injection
-
Can training data augmentation match test-time compute scaling benefits?
Can generating thinking trajectories during pretraining unlock the same efficiency gains that test-time scaling provides at inference? This explores whether the compute-allocation principle works across the training-inference boundary.
both are data-centric training augmentations
-
Does planning direction affect how hard problems become?
Planning research typically goes forward only. But some problems get easier when you work backward from the goal. What makes direction matter, and can language models exploit this?
both address the forward-only limitation of autoregressive generation: TRELAWNEY injects goal/future information into training data so the model learns to condition on goals, while backward planning reverses the search direction at inference time; TRELAWNEY could be seen as training the model to internalize the benefits backward planning provides at test time
-
Which sentences actually steer a reasoning trace?
Can we identify which sentences in a reasoning trace have outsized influence on the final answer? Three independent methods converge on a surprising answer about planning and backtracking.
thought anchors (especially planning sentences) may be the behavioral manifestation of TRELAWNEY-like goal conditioning: the model generates planning sentences that function as self-imposed lookahead tokens, conditioning subsequent generation on anticipated goals
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Looking beyond the next token
- Emergent Hierarchical Reasoning In LLMs Through Reinforcement Learning
- Chain of Thoughtlessness? An Analysis of CoT in Planning
- On the Limits of Innate Planning in Large Language Models
- A Survey on Test-Time Scaling in Large Language Models: What, How, Where, and How Well?
- Beyond Semantics: The Unreasonable Effectiveness of Reasonless Intermediate Tokens
- Position: LLMs Can't Plan, But Can Help Planning in LLM-Modulo Frameworks
- Reasoning Language Models: A Blueprint
Original note title
data-centric lookahead tokens enable planning without architectural changes by embedding future information in training sequences