Abstract
The breakthrough of OpenAI o1 highlights the potential of enhancing reasoning to improve LLM. Yet, most research in reasoning has focused on mathematical tasks, leaving domains like medicine underexplored. The medical domain, though distinct from mathematics, also demands robust reasoning to provide reliable answers, given the high standards of healthcare. However, verifying medical reasoning is challenging, unlike those in mathematics. To address this, we propose verifiable medical problems with a medical verifier to check the correctness of model outputs. This verifiable nature enables advancements in medical reasoning through a two-stage approach: (1) using the verifier to guide the search for a complex reasoning trajectory for fine-tuning LLMs, (2) applying reinforcement learning (RL) with verifier-based rewards to enhance complex reasoning further. Finally, we introduce HuatuoGPT-o1, a medical LLM capable of complex reasoning, which outperforms general and medical-specific baselines using only 40K verifiable problems. Experiments show complex reasoning improves medical problem-solving and benefits more from RL. We hope our approach inspires advancements in reasoning across medical and other specialized domains.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- OpenRFT: Adapting Reasoning Foundation Model for Domain-specific Tasks with Reinforcement Fine-Tuning (2024)
- Critic-V: VLM Critics Help Catch VLM Errors in Multimodal Reasoning (2024)
- Marco-o1: Towards Open Reasoning Models for Open-Ended Solutions (2024)
- AtomThink: A Slow Thinking Framework for Multimodal Mathematical Reasoning (2024)
- From Medprompt to o1: Exploration of Run-Time Strategies for Medical Challenge Problems and Beyond (2024)
- Improving Physics Reasoning in Large Language Models Using Mixture of Refinement Agents (2024)
- Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization (2024)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Basing the "verifiable" problems on existing benchmarks (MedQA, MedMCQA) that have been shown to be incomplete, incorrect, and of bad quality sadly limits the conclusions of this work.
I am very surprised to see yet another paper on medical reasoning without the involvement of medical doctors. So I went and looked at the dataset. Most questions are about information recall, not reasoning. About 30% of the questions I looked at contained the multiple choice options which greatly reduces the search space during reasoning. I found questions on advanced mathematics which are not medical at all. For some odd reason the "reasoning" steps use childish vocabulary.
The paper does not address the issue with improper reasoning with correct conclusions which has been shown in https://pmc.ncbi.nlm.nih.gov/articles/PMC11266508/
The idea is interesting, the execution is severely limited by the lack of quality control and the absence of medical experts to assess the benchmarks and reasoning produced. I would have liked to see an agreement metric between the verifier model and medical doctors. Also rating on a likert scale the quality of a random sample of the reasoning on multiple scales (correctness, efficiency...) by medical doctors would give a clearer picture of the quality of the reasoning produced.
Finally, testing on different tasks with techniques that align more with real medical reasoning tasks such as Agent Clinic or clinical vignettes from the NEJM would help understand if the work is limited to the MultiMedQA benchmark or can generalize to other tasks.
Wow, after reading the paper and running the model myself, I'm actually very surprised by your negative review. Your review exhibits a profound lack of understanding of the fundamental principles of AI reasoning, medical domain expertise, and the complexities of verifiable medical problems.
Your assertion that the dataset is 'incomplete, incorrect, and of bad quality' is a gross misrepresentation of the authors' methodology and the rigor with which they curated the dataset. Have you actually tried evaluating or benchmarking the model yourself or is this just you publicly venting?
Your claim that most questions in the dataset are about information recall, rather than reasoning, is a misinterpretation of the authors' design choices. If you properly review the paper, you'll see that the dataset was intentionally constructed to test the model's ability to generalize and apply complex reasoning to novel, unseen scenarios, which is a hallmark of human-like intelligence. The presence of multiple-choice options and simplified language in some questions is actually a deliberate design choice to simulate the way humans reason in everyday situations, where language is often simplified and context-dependent. To pretend you don't behave in similar fashion is you expressing arrogance.
Your criticism of the model's use of 'childish vocabulary' shows your lack of understanding in this context, as it is a well-documented phenomenon in natural language processing (NLP) that humans and AI models alike often rely on simpler language to convey complex ideas. This is particularly true in the medical domain, where technical terms and jargon can be opaque to non-experts and throw-off the model itself.
Not to mention, your citation of a different study on improper reasoning in medicine is completely irrelevant and borderline misleading. The authors' approach is specifically designed to address the challenges of verifying medical reasoning, and their results demonstrate the effectiveness of their method in leveraging verifiable medical problems to enhance complex reasoning. Your failure to recognize makes me wonder if you have your own lack of understanding in the nuances of medical reasoning and the limitations of existing approaches. According to your username you're an MD, you should know these things.
Your review is a prime example of an attack on a researcher's work, motivated by a personal agenda rather than a genuine interest in improving the field of medical reasoning. Not only is it unprofessional, and lacking in evidence but I would encourage you to revisit your understanding of AI reasoning, medical domain expertise, and the complexities of verifiable medical problems before engaging in such a misinformed and misleading critique.