DeepSeek-R1 Thoughtology: Let's think about LLM Reasoning

Paper · arXiv 2504.07128 · Published April 2, 2025
Inference-Time ScalingReasoning Critiques

Abstract Large Reasoning Models like DeepSeek-R1 mark a fundamental shift in how LLMs approach complex problems. Instead of directly producing an answer for a given input, DeepSeek-R1 creates detailed multi-step reasoning chains, seemingly “thinking” about a problem before providing an answer. This reasoning process is publicly available to the user, creating endless opportunities for studying the reasoning behaviour of the model and opening up the field of Thoughtology. Starting from a taxonomy of DeepSeek-R1’s basic building blocks of reasoning, our analyses on DeepSeek-R1 investigate the impact and controllability of thought length, management of long or confusing contexts, cultural and safety concerns, and the status of DeepSeek-R1 vis-`a-vis cognitive phenomena, such as human-like language processing and world modelling. Our findings paint a nuanced picture. Notably, we show DeepSeek-R1 has a ‘sweet spot’ of reasoning, where extra inference time can impair model performance. Furthermore, we find a tendency for DeepSeek-R1 to persistently ruminate on previously explored problem formulations, obstructing further exploration. We also note strong safety vulnerabilities of DeepSeek-R1 compared to its non-reasoning counterpart, which can also compromise safety-aligned LLMs.

Introduction. Recent advancements in building large language models (LLMs) have shifted the focus towards developing models capable of complex multi-step reasoning (DeepSeek-AI et al., 2025a, OpenAI, 2024). While initial work on LLMs focused on eliciting reasoning using chain-of-thought (CoT) prompting (Wei et al. 2022, Zhou et al. 2023), we see a fundamental shift where reasoning is embedded into models such that they reason before they arrive at an answer. We call this class of models Large Reasoning Models (LRMs) and refer to their reasoning chains as thoughts.1 LRMs generate thoughts step-by-step that can accumulate progress towards a solution, self-verify, or explore alternative approaches until the model is confident about a final answer. Figure 1.1 shows a comparison of the outputs of an LLM versus an LRM. Although the output of the LLM can include some intermediate reasoning steps, there is often no exploration. Furthermore, if the model fails, it is unable to backtrack and explore alternatives.

Discussion / Conclusion. In this work, we take the first step in studying the chains of thought of DeepSeek-R1. We introduce a new taxonomy to describe LRM reasoning chains, and then use this taxonomy to identify key strengths and weakness of DeepSeek-R1 across various tasks. Our analyses focus on the effects and controllability of thought length (Sections 4 and 11); model behavior in long or confusing contexts (Sections 5 and 6); LRM cultural and safety concerns (Sections 7 and 8); and the status of LRMs vis-`a-vis cognitive phenomena (Sections 9 and 10). Through our analyses, several key patterns emerge, which we highlight below. Reasoning behaviours We show in Section 3 that, across a wide range of tasks, DeepSeek- R1 exhibits a consistent pattern in its reasoning process where, after briefly defining a goal (‘Problem Definition’), it lets a problem ‘Bloom’ by decomposing the given problem into subcomponents which it immediately solves.