Why Chain of Thought Fails in Clinical Text Understanding

arXiv — cs.CL•Tuesday, December 9, 2025 at 5:00:00 AM

NeutralArtificial Intelligence

A systematic study has revealed that chain-of-thought (CoT) prompting, which is often used to enhance reasoning in large language models (LLMs), fails to improve performance in clinical text understanding. The research assessed 95 advanced LLMs across 87 real-world clinical tasks, finding that 86.3% of models experienced performance degradation in CoT settings, particularly with electronic health records that are lengthy and fragmented.
This finding is significant as it raises concerns about the reliability of LLMs in clinical settings, where accurate and transparent reasoning is crucial for patient safety. The degradation in performance suggests that current methodologies may not be suitable for the complexities of clinical documentation, potentially impacting the deployment of AI in healthcare.
The challenges faced by LLMs in clinical contexts echo broader issues in AI, such as the inconsistencies in belief updating and action alignment, as well as the limitations of hierarchical instruction schemes. These recurring themes highlight the need for improved frameworks and methodologies to enhance the effectiveness of AI in specialized fields like healthcare.

— via World Pulse Now AI Editorial System

Read Original

Was this article worth reading? Share it

LucidQuery AI

Combines diffusion reasoning with autoregressive LLM for advanced AI analysis.

AI & DataView app details

Nudge AI

Automatically transcribe and summarize medical conversations for healthcare professionals.

Business & ProductivityView app details

Supanote

Automate HIPAA-compliant therapy progress notes with AI assistance.

AI & DataView app details

Continue Readings

arXiv — cs.CL2 days ago

QSTN: A Modular Framework for Robust Questionnaire Inference with Large Language Models

PositiveArtificial Intelligence

QSTN has been introduced as an open-source Python framework designed to generate responses from questionnaire-style prompts, facilitating in-silico surveys and annotation tasks with large language models (LLMs). The framework allows for robust evaluation of questionnaire presentation and response generation methods, based on an extensive analysis of over 40 million survey responses.

Read full article

via arXiv — cs.CL

arXiv — cs.LG2 days ago

Balanced Accuracy: The Right Metric for Evaluating LLM Judges - Explained through Youden's J statistic

NeutralArtificial Intelligence

The evaluation of large language models (LLMs) is increasingly reliant on classifiers, either LLMs or human annotators, to assess desirable or undesirable behaviors. A recent study highlights that traditional metrics like Accuracy and F1 can be misleading due to class imbalances, advocating for the use of Youden's J statistic and Balanced Accuracy as more reliable alternatives for selecting evaluators.

Read full article

via arXiv — cs.LG

arXiv — cs.CL2 days ago

Representational Stability of Truth in Large Language Models

NeutralArtificial Intelligence

Large language models (LLMs) are increasingly utilized for factual inquiries, yet their internal representations of truth remain inadequately understood. A recent study introduces the concept of representational stability, assessing how robustly LLMs differentiate between true, false, and ambiguous statements through controlled experiments involving linear probes and model activations.

Read full article

via arXiv — cs.CL

arXiv — cs.CL2 days ago

When Many-Shot Prompting Fails: An Empirical Study of LLM Code Translation

NeutralArtificial Intelligence

A recent empirical study on Large Language Models (LLMs) has revealed that the effectiveness of many-shot prompting for code translation may be overstated. Analyzing over 90,000 translations, researchers found that while more examples can improve static similarity metrics, functional correctness peaks with fewer examples, indicating a 'many-shot paradox'.

Read full article

via arXiv — cs.CL

arXiv — cs.CL2 days ago

Dual Mechanisms of Value Expression: Intrinsic vs. Prompted Values in LLMs

NeutralArtificial Intelligence

Large language models (LLMs) exhibit two mechanisms of value expression: intrinsic, based on learned values, and prompted, based on explicit prompts. This study analyzes these mechanisms at a mechanistic level, revealing both shared and unique components in their operation.

Read full article

via arXiv — cs.CL

arXiv — cs.CL2 days ago

A Systematic Evaluation of Preference Aggregation in Federated RLHF for Pluralistic Alignment of LLMs

PositiveArtificial Intelligence

A recent study has introduced a systematic evaluation framework for aligning large language models (LLMs) with diverse human preferences in federated learning environments. This framework assesses the trade-off between alignment quality and fairness using various aggregation strategies for human preferences, including a novel adaptive scheme that adjusts preference weights based on historical performance.

Read full article

via arXiv — cs.CL

arXiv — cs.CL2 days ago

Can AI Truly Represent Your Voice in Deliberations? A Comprehensive Study of Large-Scale Opinion Aggregation with LLMs

NeutralArtificial Intelligence

A comprehensive study has been conducted on the use of large language models (LLMs) for synthesizing public deliberations into neutral summaries. The research highlights the potential of LLMs to generate summaries while also addressing concerns regarding their ability to represent minority perspectives and biases related to input order. The study introduces DeliberationBank, a dataset created from contributions by 3,000 participants, aimed at evaluating LLM performance in summarization tasks.

Read full article

via arXiv — cs.CL

arXiv — cs.CL2 days ago

LLMs Can't Handle Peer Pressure: Crumbling under Multi-Agent Social Interactions

NeutralArtificial Intelligence

Large language models (LLMs) are increasingly being integrated into multi-agent systems (MAS), where peer interactions significantly influence decision-making. A recent study introduces KAIROS, a benchmark designed to simulate collaborative quiz-style interactions among peer agents, allowing for a detailed analysis of how rapport and peer behaviors affect LLMs' decision-making processes.

Read full article

via arXiv — cs.CL