Can models learn reasoning from predicting any text?
Does training rationale generation at every token position on arbitrary internet text enable general reasoning without task-specific supervision? This challenges the assumption that reasoning requires curated QA datasets.
STaR showed that LMs can bootstrap reasoning by training on rationales that led to correct answers on curated QA datasets. Quiet-STaR generalizes this in one critical way: rather than generating a rationale per problem, it generates a rationale at every token position to explain future text. The training corpus is arbitrary internet text, not curated reasoning tasks.
The mechanism: at each token, the model generates a thought, mixes the thought-conditioned next-token prediction with the raw next-token prediction via a learned mixing head, and uses REINFORCE to improve thought quality. Custom meta-tokens signal thought boundaries, allowing the model to learn when to generate rationales and when to commit predictions.
The key shift: from task-specific reasoning ("do this type of math problem") to text-general reasoning ("what reasoning helps predict what comes next in any text?"). STaR's ceiling was its dependency on curated QA datasets — high-quality, but inherently narrow. Quiet-STaR's ceiling is the diversity of the pretraining corpus.
Because rationale quality is judged by predictive accuracy on future text rather than correctness on labeled answers, the method generalizes across the tasks present in language rather than the tasks present in annotation pipelines. The "task" is prediction itself.
This remains constrained by training distribution: rationales that help predict common internet text patterns may not generalize to hard reasoning requiring novel inference that rarely appears in the corpus. But it suggests that general reasoning competence may be trainable as a side effect of improved language modeling, rather than as a separate supervised objective.
Inquiring lines that use this note as a source 27
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- How do transformers perform multi-hop reasoning across distant training documents?
- How much does training data format shape what reasoning strategy emerges?
- Does the DeepSeek R1 single token insertion represent genuine reasoning?
- Why do open-source models trained on proprietary outputs still fail at reasoning?
- How does cross-domain reasoning transfer differ from domain-specific knowledge transfer?
- Can fine-tuning ever teach semantic inference instead of amplifying training shortcuts?
- How does inductive reasoning from partial evidence enable hypothesis formation?
- Can reasoning catalyst data serve as a stable foundation for test-time training?
- How do single training examples activate reasoning capabilities in language models?
- Do base models contain latent reasoning that minimal training can unlock?
- Why does eliminating proxy-model filtering improve reasoning emergence in pretraining?
- Does latent reasoning capability exist in base models before any training?
- Can models reason at inference without specialized internal training?
- Can smaller amounts of diverse reasoning demonstrations replace exhaustive factual training data?
- How do meta-tokens help models learn when to generate reasoning versus commit predictions?
- Why might rationales that predict common text patterns fail on hard novel reasoning?
- Can reasoning learned from language modeling actually transfer to knowledge-intensive domains?
- Does token-level reasoning during pretraining improve general reasoning without task-specific supervision?
- What kinds of reasoning tasks reveal the ceiling of text-only training?
- Does the token prediction framing actually capture what human reasoning does?
- How much does training data format influence reasoning strategy versus domain content?
- Can standard next-token prediction capture complex multi-step human reasoning directly?
- How does training data structure shape reasoning strategy more than domain content?
- Can base models spontaneously produce reasoning traces without any RL training?
- Can articulating latent reasoning processes improve transfer across domains?
- Why does latent-level prediction beat token-level prediction for reasoning?
- Can small demonstration sets unlock general reasoning without large question data?
Related concepts in this collection 7
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Do base models already contain hidden reasoning ability?
Explores whether reasoning capability emerges during pre-training as a latent feature rather than being created by post-training methods like reinforcement learning or fine-tuning.
complements: Quiet-STaR offers a pretraining-time mechanism for the same underlying capability
-
Does RL teach reasoning or just when to use it?
Does reinforcement learning in thinking models actually create new reasoning abilities, or does it simply teach existing capabilities when to activate? This matters for understanding where reasoning truly emerges.
contrasts: Quiet-STaR bakes reasoning into the forward pass at every token; RL teaches when to engage reasoning mechanisms at deployment
-
Why doesn't mathematical reasoning transfer to medicine?
Can models trained to reason well about math apply those skills to medical domains through fine-tuning? This explores whether reasoning ability is truly domain-agnostic or constrained by domain-specific knowledge requirements.
extends: Quiet-STaR's ceiling is training distribution diversity; this note explains why general reasoning competence, however trained, hits a floor when domain-specific knowledge is absent
-
Can training data augmentation match test-time compute scaling benefits?
Can generating thinking trajectories during pretraining unlock the same efficiency gains that test-time scaling provides at inference? This explores whether the compute-allocation principle works across the training-inference boundary.
parallel token-level reasoning during pretraining: Quiet-STaR modifies the training objective to learn rationales at each token, while TPT augments the training data with externally-generated thinking trajectories; different intervention points (objective vs. data) targeting the same problem of making pretraining reasoning-aware
-
Can models learn to internalize search algorithms through training?
Can chain-of-thought reasoning be taught as an explicit search process that models learn to implement internally? This matters because it could unlock algorithmic optimization rather than just output optimization.
complementary internalization: Quiet-STaR trains token-level rationale generation via predictive accuracy, while Meta-CoT trains trace-level search strategies via linearized MCTS/A* — together they suggest reasoning internalization is possible at multiple granularities from individual predictions to complete search procedures
-
Can next-token prediction become a reasoning task with RL?
Does reinforcement learning applied to next-token prediction during pretraining encourage genuine reasoning rather than surface memorization? This matters because it could unlock reasoning capability without requiring labeled data or human feedback.
parallel approach: RPT uses next-token verification as RL reward signal at the same token-level granularity; Quiet-STaR generates rationales via REINFORCE while RPT reasons about predictions via RL, both treating the pretraining corpus as the training signal for reasoning
-
Can models learn to evaluate their own work during training?
Explores whether language models can internalize reward function computation as part of training, transforming external feedback into internal self-assessment capability without slowing inference.
complementary training-time reasoning augmentation: Quiet-STaR generates rationales at every token position, PCL generates self-evaluations in post-EOS space; both add auxiliary reasoning during training that shapes the model without inference cost, but at different positions (pre-token vs. post-answer)
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Quiet-STaR: Language Models Can Teach Themselves to Think Before Speaking
- RL-STaR: Theoretical Analysis of Reinforcement Learning Frameworks for Self-Taught Reasoner
- Looking beyond the next token
- Mining Hidden Thoughts from Texts: Evaluating Continual Pretraining with Synthetic Data for LLM Reasoning
- On the Interplay of Pre-Training, Mid-Training, and RL on Reasoning Language Models
- Efficient Tool Use with Chain-of-Abstraction Reasoning
- Reasoning to Learn from Latent Thoughts
- Chain-of-Thought Reasoning Without Prompting
Original note title
quiet-star learns rationale generation at the token level not the task level enabling general reasoning without task-specific supervision