Can reconstructing expert thinking improve reasoning transfer?
Expert texts show only the final result of complex thinking. Can we reverse-engineer those hidden thought processes and use them to train models that reason better across different domains?
Standard reasoning training uses supervised fine-tuning or reinforcement learning, which require task-specific signals (math correctness, code execution) and therefore cannot scale across domains where verifiable feedback is unavailable. Continual pretraining (CPT) avoids this constraint but provides no reasoning signal — the model just sees more text. Reasoning CPT proposes a third path: every expert text (a math proof, a legal opinion) is the visible result of an underlying thought process involving trial, hypothesis, recall, and verification, and that hidden thought process can be reconstructed as synthetic data — the same surface-vs-process distinction that drives Why do language models need so much more text than humans?.
The reconstruction targets four characteristic aspects of expert thinking: human-like spontaneous expressions ("Hmm... ", "Aha!"), background knowledge recall (internally retrieving relevant rules), decision-making (considering an action), and self-verification (checking for omissions). The synthetic training sequence concatenates the original text with its reconstructed hidden thoughts, giving the model both the visible result and the implicit process behind it.
Three findings distinguish this from standard CPT. First, cross-domain transfer: training hidden thoughts from law improves not just MMLU social sciences but MMLU-STEM by 4.3 points, because the reasoning skill — not the domain knowledge — transfers. Second, the gap widens with difficulty: on the hardest MMLU problems, Reasoning CPT reaches 51.8-52.5% accuracy versus 43.9-44.6% for CPT, a roughly 8-point advantage. Third, models automatically adjust reasoning length to problem difficulty — short for easy, long for hard — without explicit instruction.
A plausible mechanism for the adaptive reasoning length: the training corpus shows positive correlation between original-text length and hidden-thought length (Spearman ρ = 0.348 STEM, 0.486 Law). The model learns a heuristic — continue thinking until enough evidence accumulates to confidently predict the next token — which produces short chains for easy questions and long chains for hard ones. The implication is that overthinking and underthinking are both consequences of training on text that does not reveal its own thinking-effort calibration.
Inquiring lines that use this note as a source 12
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- How does removing thinking labor affect expert understanding of their field?
- Can extended thinking genuinely improve reasoning or just increase variance?
- Why do monological explanations fail to transfer understanding compared to dialogical ones?
- Why does general reasoning not transfer to knowledge-intensive medical domains?
- How does cross-domain reasoning transfer differ from domain-specific knowledge transfer?
- Can we transfer reasoning structure without copying surface form?
- Why does polished presentation substitute for deeper expert judgment?
- How do reasoning training methods sacrifice some thinking skills while improving others?
- What makes expert writing harder to learn from than surface text alone?
- Can explicit reflection during AI-assisted work improve transfer of learning?
- Why does decomposition ability transfer across domains but solving ability does not?
- Can articulating latent reasoning processes improve transfer across domains?
Related concepts in this collection 5
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Why do language models need so much more text than humans?
Language models train on the surface of written text, but humans learn by inferring the underlying thoughts behind what they read. Does this explain why models need vastly more data to reach human-level understanding?
extends: companion piece — same compressed-surface diagnosis applied at the pretraining-data level instead of the inference level
-
Can chain-of-thought reasoning be learned during pretraining itself?
Explores whether reasoning emerges more effectively when models treat thinking as an exploratory action during next-token prediction, rather than only after pretraining through reinforcement learning.
complements: RPT and Reasoning CPT both train reasoning at pretraining time but with different signals — information-gain reward vs reconstructed hidden thoughts
-
Can next-token prediction become a reasoning task with RL?
Does reinforcement learning applied to next-token prediction during pretraining encourage genuine reasoning rather than surface memorization? This matters because it could unlock reasoning capability without requiring labeled data or human feedback.
complements: RPT generalizes reasoning to any domain via RL on next-token; this note generalizes via reconstructed thoughts; both attack domain-specificity of reasoning training
-
Do base models already contain hidden reasoning ability?
Explores whether reasoning capability emerges during pre-training as a latent feature rather than being created by post-training methods like reinforcement learning or fine-tuning.
complements: hidden-thought reconstruction as a way of activating latent capability without RLVR's verifiability requirement
-
Does AI text generation unfold through temporal reflection?
Explores whether the sequential ordering of tokens in LLM generation constitutes genuine temporal thought or merely probabilistic computation without reflective duration.
tension: reconstructed thoughts add a quasi-temporal trace ("Hmm... Aha!") to training data, but surface markers of temporal cognition do not actually install temporality
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Mining Hidden Thoughts from Texts: Evaluating Continual Pretraining with Synthetic Data for LLM Reasoning
- Supervised Reinforcement Learning: From Expert Trajectories to Step-wise Reasoning
- Reasoning to Learn from Latent Thoughts
- Reverse Thinking Makes LLMs Stronger Reasoners
- Implicit Chain of Thought Reasoning via Knowledge Distillation
- Thinking Augmented Pre-training
- Base Models Know How to Reason, Thinking Models Learn When
- Is Chain-of-Thought Reasoning of LLMs a Mirage? A Data Distribution Lens
Original note title
expert texts are surface residues of hidden thought processes — and reconstructing those processes for pretraining produces cross-domain reasoning transfer impossible in standard CPT