The Hallucination Tax of Reinforcement Finetuning

Paper · arXiv 2505.13988 · Published May 20, 2025
Training and Fine-TuningLLM Failure Modes

Reinforcement finetuning (RFT) has become a standard approach for enhancing the reasoning capabilities of large language models (LLMs). However, its impact on model trustworthiness remains underexplored. In this work, we identify and systematically study a critical side effect of RFT, which we term the hallucination tax: a degradation in refusal behavior causing models to produce hallucinated answers to unanswerable questions confidently. To investigate this, we introduce SUM (Synthetic Unanswerable Math), a high-quality dataset of unanswerable math problems designed to probe models’ ability to recognize an unanswerable question by reasoning from the insufficient or ambiguous information. Our results show that standard RFT training could reduce model refusal rates by more than 80%, which significantly increases model’s tendency to hallucinate. We further demonstrate that incorporating just 10% SUM during RFT substantially restores appropriate refusal behavior, with minimal accuracy trade-offs on solvable tasks. Crucially, this approach enables LLMs to leverage inference-time compute to reason about their own uncertainty and knowledge boundaries, improving generalization not only to outof-domain math problems but also to factual question answering tasks.

Introduction. Reinforcement finetuning (RFT), a method that aligns large language models’ (LLMs) behavior with verifiable objectives through reinforcement learning, has become increasingly popular as a post-training strategy to enhance the reasoning capabilities of LLMs (OpenAI, 2024; Guo et al., 2025). Recent research on RFT has largely focused on improving its efficiency (Yu et al., 2025; Li et al., 2025b; Shi et al., 2025; Wang et al., 2025b) and enhancing model performance on mathematics and code generation (Luo et al., 2025; Hu et al., 2025; Zhao et al., 2025). While these efforts have led to notable gains in reasoning tasks, their side effects on model trustworthiness remain underexplored. One particularly concerning phenomenon is the tendency of models to be overconfident after RFT: they provide answers even when questions are ambiguous, under-specified, or fundamentally unanswerable.

Discussion / Conclusion. Our results highlight a key unintended consequence of RFT: the erosion of refusal behavior when faced with unanswerable questions—a phenomenon we term the hallucination tax. This arises from reward functions that fail to discourage overconfident answers in ambiguous settings. We show that introducing synthetic unanswerable math (SUM) offers a simple and effective way to mitigate this issue. We identify the hallucination tax of reinforcement finetuning (RFT), where models increasingly produce hallucinations by answering unanswerable questions with unjustified confidence. To study and mitigate this phenomenon, we introduce SUM (Synthetic Unanswerable Math), a dataset of implicitly unanswerable math problems. Our experiments show that standard RFT amplifies hallucination while incorporating just 10% SUM data enables models to leverage inference-time compute to reason about uncertainty and recognize their knowledge boundaries, with minimal impact on accuracy.