Diagnostic Reasoning Prompts Reveal the Potential for Large Language Model Interpretability in Medicine
One of the major barriers to using large language models (LLMs) in medicine is the perception they use uninterpretable methods to make clinical decisions that are inherently different from the cognitive processes of clinicians. In this manuscript we develop novel diagnostic reasoning prompts to study whether LLMs can perform clinical reasoning to accurately form a diagnosis. We find that GPT-4 can be prompted to mimic the common clinical reasoning processes of clinicians without sacrificing diagnostic accuracy. This is significant because an LLM that can use clinical reasoning to provide an interpretable rationale offers physicians a means to evaluate whether LLMs can be trusted for patient care. Novel prompting methods have the potential to expose the “black box” of LLMs, bringing them one step closer to safe and effective use in medicine.
Introduction. Large Language Models (LLMs) have received widespread attention for their human-like performance on a wide variety of text-based tasks. Within medicine, initial efforts have demonstrated that LLMs can write clinical notes1, pass standardized medical exams2, and draft responses to patient questions.3,4 In order to integrate LLMs more directly into clinical care, it is imperative to better understand their clinical reasoning capabilities. Clinical reasoning is a set of problem-solving processes specifically designed for diagnosis and management of a patient’s medical condition. Commonly used diagnostic techniques include differential diagnosis formation, intuitive reasoning, analytical reasoning, and Bayesian inference.
Discussion / Conclusion. Figure 3. A) Current LLM workflow. B) Proposed LLM workflow. The strengths of our investigation are a novel design that leverages chain-of-thought prompting for insight into LLM clinical reasoning capabilities as well as the use of free response clinical case questions where previous studies have been limited to multiple-choice or simple open-ended fact retrieval that do not challenge LLM clinical reasoning abilities. We designed our evaluation with free response questions both the USMLE to facilitate rigorous comparison between prompting strategies. A limitation of our study is that while our prompt engineering process surveyed a wide range of prompt styles, we were not able to test all possible diagnostic reasoning CoT prompts. We hope that future studies can iterate on our diagnostic reasoning prompts and use our open dataset as a benchmark for additional evaluation.