VibeThinker-3B: Exploring the Frontier of Verifiable Reasoning in Small Language Models

Paper · arXiv 2606.16140 · Published June 15, 2026
LLM Failure Modes

Abstract This technical report introduces VibeThinker-3B, a compact dense model with 3B parameters developed to investigate how far verifiable reasoning can be pushed within a strictly small-model regime. Building upon the Spectrum-to-Signal post-training paradigm, we systematically enhance the model through an optimized pipeline that includes curriculum-based supervised fine-tuning, multi-domain reinforcement learning, and offline self-distillation. Experimental evaluations demonstrate that VibeThinker-3B achieves frontier-level performance on highly demanding verifiable tasks. Specifically, it attains a score of 94.3 on AIME26 (improving to 97.1 with claim-level test-time scaling), an 80.2 Pass@1 on LiveCodeBench v6, and exhibits strong out-ofdistribution generalization with a 96.1% acceptance rate on recent unseen LeetCode contests. This effectively places it in the performance band of first-tier reasoning systems, matching or exceeding flagship models that are orders of magnitude larger, such as DeepSeek V3.2, GLM-5, and Gemini 3 Pro. Furthermore, a score of 93.4 on IFEval confirms that this extreme reasoning enhancement does not compromise strict instruction controllability.

Introduction. As reinforcement learning [1–4] has become increasingly integrated into the post-training stage of language models, the complex logical reasoning abilities of large models have improved substantially. At present, the field commonly relies on increasing parameter scale, following scaling laws, to cross the threshold required by difficult reasoning tasks. As a result, frontier reasoning ability is often concentrated in models with tens or hundreds of billions of parameters. In contrast, small language models (SLMs) with 3B parameters or fewer offer clear advantages in deployment cost, inference efficiency, and broader accessibility for academic research, but they are generally considered to face inherent bottlenecks when handling difficult mathematical derivations or complex programming tasks. Our previous work on VibeThinker-1.5B [5] demonstrated that even models with extremely small parameter counts can be elicited to produce stable and basic chains of logic. This was an initial attempt to challenge the common belief that small models struggle with long-horizon reasoning.

Discussion / Conclusion. In this report, we present VibeThinker-3B, a compact reasoning model comprising only 3 billion parameters. On challenging verifiable reasoning benchmarks, including AIME26, HMMT25, IMO-AnswerBench, and LiveCodeBench v6, it delivers strong results and further demonstrates robust generalization on out-of-distribution LeetCode evaluations. Taken together, these evaluations show that VibeThinker-3B reaches a performance band comparable to representative frontier LLMs, such as GLM-5, Kimi K2.5, Gemini 3 Pro, and Claude Opus 4.5, providing evidence that small language models can effectively approximate frontier reasoning capabilities on highly complex verifiable tasks despite much smaller parameter scales.