PretrainZero: Reinforcement Active Pretraining

Paper · arXiv 2512.03442 · Published December 3, 2025
Reinforcement Learning

Mimicking human behavior to actively learning from general experience and achieve artificial general intelligence has always been a human dream. Recent reinforcement learning (RL) based large-thinking models demonstrate impressive expert-level abilities, i.e., software and math, but still rely heavily on verifiable rewards in specific domains, placing a significant bottleneck to extend the performance boundary of general reasoning capabilities. In this work, we propose PretrainZero, a reinforcement active learning framework built on the pretraining corpus to extend RL from domain-specific post-training to general pretraining. PretrainZero features the following characteristics: 1) Active pretraining: inspired by the active learning ability of humans, PretrainZero learns a unified reasoning policy to actively identify reasonable and informative contents from pretraining corpus, and reason to predict these contents by RL. 2) Self-supervised learning: without any verifiable labels, pretrained reward models, or supervised fine-tuning, we directly pretrain reasoners from 3 ∼30B base models on the general Wikipedia corpus using RL, significantly breaking the verification data-wall for general reasoning.

Introduction. Recent large language models (LLMs) have achieved human-level expertise in specific domains, particularly through large-scale self-supervised learning in pretraining [KMH+20, AAA+23] and Reinforcement Learning (RL) [GYZ+25, YZZ+25, CZY+25] in post-training. During pretraining, self-supervised learning with a fixed next-token prediction paradigm allows models to leverage largescale, low-cost data to improve general capabilities effectively. In contrast, the post-training RL faces a severe data-wall: Reinforcement Learning with Verifiable Rewards (RLVR) [GYZ+25, YCL+] requires domain-specific verifiers to label training samples, and Reinforcement Learning from Human Feedback (RLHF) [OWJ+22, BJN+22], relying on reward models and humans, can only train limited steps to avoid reward hacking. This motivates a natural direction—performing reinforcement learning [DDT+25, LLX+25] in a self-supervised pretraining manner [BMR+20], in order to use inexpensive pretraining data to extend RLVR and overcome this data-wall.

Discussion / Conclusion. This work introduces the stand-alone reinforcement pretraining method in a real-world pretraining corpus, named PretrainZero. Coupled with PretrainZero, a new reinforcement active pretraining framework is proposed to explore informative, verifiable, and not-yet-mastered content in noisy pretraining data. Thanks to active learning ability, PretrainZero significantly surpasses previous fixed learning patterns, such as continued pretraining, supervised fine-tuning, and random or entropy-based reinforcement pretraining. We reveal that even Wikipedia, which has already been trained during base model pretraining, can successfully improve end-task performance with reinforcement and active learning methods. We believe that there would be great potential to explore more efficient learning patterns to discover latent information from the pretraining corpus in the future.