Asking the Right Questions: Benchmarking Large Language Models in the Development of Clinical Consultation Templates
PositiveArtificial Intelligence
The recent study by Stanford's eConsult team assessed the effectiveness of various large language models (LLMs) in generating structured clinical consultation templates. By utilizing 145 expert-crafted templates, the research revealed that models such as o3 and GPT-4o could achieve a high level of comprehensiveness, reaching up to 92.2%. However, these models frequently produced excessively long templates and failed to prioritize the most clinically significant questions, particularly in narrative-driven fields like psychiatry and pain medicine. This indicates that while LLMs have the potential to enhance structured clinical information exchange between physicians, there is a pressing need for more robust evaluation methods to ensure that these models can effectively prioritize clinically salient information. The findings call attention to the importance of refining LLM capabilities to better serve the healthcare sector.
— via World Pulse Now AI Editorial System

