100 Days After DeepSeek-R1: A Survey on Replication Studies and More Directions for Reasoning Language Models

Paper · arXiv 2505.00551 · Published May 1, 2025
RL with Verifiable Rewards (RLVR)LLM Evaluations and BenchmarksReasoning Architectures

The recent development of reasoning language models (RLMs) represents a novel evolution in large language models. In particular, the recent release of DeepSeek- R1 has generated widespread social impact and sparked enthusiasm in the research community for exploring the explicit reasoning paradigm of language models. However, the implementation details of the released models have not been fully open-sourced by DeepSeek, including DeepSeek-R1-Zero, DeepSeek-R1, and the distilled small models. As a result, many replication studies have emerged aiming to reproduce the strong performance achieved by DeepSeek-R1, reaching comparable performance through similar training procedures and fully open-source data resources. These works have investigated feasible strategies for supervised finetuning (SFT) and reinforcement learning from verifiable rewards (RLVR), focusing on data preparation and method design, yielding various valuable insights. In this report, we provide a summary of recent replication studies to inspire future research. We primarily focus on SFT and RLVR as two main directions, introducing the details for data construction, method design and training procedure of current replication studies.

Introduction. Reasoning language models (RLMs) have emerged as a transformative advancement in the evolution of large language models (LLMs), such as OpenAI o-series (Jaech et al., 2024; OpenAI, 2025), DeepSeek-R1 (Guo et al., 2025), and QwQ series (Qwen-Team, 2024; 2025b). Unlike conventional LLMs that merely generate unstructured responses, these models incorporate an explicit chain-ofthought process, providing step-by-step reasoning that mimics human cognitive processes–such as invoking self-verification, reflection, and more. This shift quickly attracted attention of the LLM research community, as it meets the growing demand for better explainability in complex tasks like mathematical problem solving, code generation, and logical reasoning, as well as the pursuit of steadily increasing accuracy. The significance of RLMs lies in their potential to enhance the accuracy of language models’ response with trustful rationales.

Discussion / Conclusion. In this survey, we present a comprehensive overview of the replication efforts inspired by DeepSeek- R1, with a particular emphasis on the methodologies and insights underpinning supervised finetuning and reinforcement learning approaches. We explore how open-source projects have curated instruction-tuning datasets, implemented outcome-reward-based reinforcement learning strategies, and designed reward systems aimed at enhancing models’ reasoning capabilities. Beyond synthesizing trends from current initiatives, we also offer our perspective on promising future directions for the field. These include the expansion of reasoning skills beyond mathematical and coding tasks, the advancement of model safety and interpretability, and the refinement of reward mechanisms to foster more sophisticated reasoning behaviors. We hope this survey not only captures the recent progress but also provides a solid foundation for ongoing research and marks a step forward toward achieving artificial general intelligence.