The Alignment Paradox of Medical Large Language Models in Infertility Care: Decoupling Algorithmic Improvement from Clinical Decision-making Quality

arXiv — cs.LG•Tuesday, November 25, 2025 at 5:00:00 AM

NeutralArtificial Intelligence

A recent study evaluated the alignment of large language models (LLMs) in infertility care, assessing four strategies: Supervised Fine-Tuning (SFT), Direct Preference Optimization (DPO), Group Relative Policy Optimization (GRPO), and In-Context Learning (ICL). The findings revealed that GRPO achieved the highest algorithmic accuracy, while clinicians preferred SFT for its clearer reasoning and therapeutic feasibility.
This development is significant as it highlights the ongoing challenge of integrating advanced AI models into clinical decision-making, particularly in sensitive areas like infertility care. The preference for SFT by clinicians underscores the importance of interpretability and practical applicability in medical AI.
The findings reflect broader discussions in the AI field regarding the balance between algorithmic performance and human-centered design. Issues such as hallucination mitigation, bias in model outputs, and the need for diverse reasoning capabilities are critical as LLMs are increasingly utilized in healthcare settings.

— via World Pulse Now AI Editorial System

Read Original

Was this article worth reading? Share it

LeapLife

AI-powered mental health insights to help you understand and improve your well-being.

AI & DataTry the app

Kansei

Practice and improve your language skills with personalized AI conversations.

AI & DataTry the app

Polidict

Expand your vocabulary with personalized, data-driven learning tools.

Lifestyle & HealthTry the app

Continue Readings

arXiv — cs.CLa day ago

Using tournaments to calculate AUROC for zero-shot classification with LLMs

PositiveArtificial Intelligence

A recent study has introduced a novel method for evaluating large language models (LLMs) in zero-shot classification tasks by transforming binary classifications into pairwise comparisons. This approach utilizes the Elo rating system to rank instances, thereby enhancing classification performance and providing more informative results than traditional methods.

Read full article

via arXiv — cs.CL

arXiv — cs.CLa day ago

$A^3$: Attention-Aware Accurate KV Cache Fusion for Fast Large Language Model Serving

PositiveArtificial Intelligence

A new study introduces $A^3$, an attention-aware method designed to enhance the efficiency of large language models (LLMs) by improving key-value (KV) cache fusion. This advancement aims to reduce decoding latency and memory overhead, addressing significant challenges faced in real-world applications of LLMs, particularly in processing long textual inputs.

Read full article

via arXiv — cs.CL

arXiv — cs.CVa day ago

Fairness in Multi-modal Medical Diagnosis with Demonstration Selection

PositiveArtificial Intelligence

Recent advancements in multimodal large language models (MLLMs) highlight the importance of fairness in medical image reasoning, as demonstrated by the introduction of Fairness-Aware Demonstration Selection (FADS). This method aims to mitigate demographic imbalances in model training by utilizing clustering-based sampling to create balanced and relevant demonstrations.

Read full article

via arXiv — cs.CV

arXiv — cs.CVa day ago

Systematic Reward Gap Optimization for Mitigating VLM Hallucinations

PositiveArtificial Intelligence

A novel framework called Topic-level Preference Rewriting (TPR) has been introduced to systematically optimize reward gaps in Vision Language Models (VLMs), addressing the challenges of hallucinations during data curation. This method focuses on selectively replacing semantic topics within VLM responses to enhance the accuracy of generated outputs.

Read full article

via arXiv — cs.CV

arXiv — cs.CLa day ago

Don't Take the Premise for Granted: Evaluating the Premise Critique Ability of Large Language Models

NeutralArtificial Intelligence

Recent evaluations of large language models (LLMs) have highlighted their vulnerability to flawed premises, which can lead to inefficient reasoning and unreliable outputs. The introduction of the Premise Critique Bench (PCBench) aims to assess the Premise Critique Ability of LLMs, focusing on their capacity to identify and articulate errors in input premises across various difficulty levels.

Read full article

via arXiv — cs.CL

arXiv — cs.LGa day ago

Point of Order: Action-Aware LLM Persona Modeling for Realistic Civic Simulation

PositiveArtificial Intelligence

A new study introduces a reproducible pipeline for transforming public Zoom recordings into speaker-attributed transcripts, enhancing the realism of civic simulations using large language models (LLMs). This approach includes metadata such as persona profiles and pragmatic action tags, which significantly improve the models' performance in simulating multi-party deliberation.

Read full article

via arXiv — cs.LG

arXiv — cs.CLa day ago

Representational Stability of Truth in Large Language Models

NeutralArtificial Intelligence

Recent research has introduced the concept of representational stability in large language models (LLMs), focusing on how these models encode distinctions between true, false, and neither-true-nor-false content. The study assesses this stability by training a linear probe on LLM activations to differentiate true from not-true statements and measuring shifts in decision boundaries under label changes.

Read full article

via arXiv — cs.CL

arXiv — cs.CLa day ago

Understanding and Mitigating Over-refusal for Large Language Models via Safety Representation

NeutralArtificial Intelligence

A recent study has highlighted the issue of over-refusal in large language models (LLMs), which occurs when these models excessively decline to generate outputs due to safety concerns. The research proposes a new approach called MOSR, which aims to balance safety and usability by addressing the representation of safety in LLMs.

Read full article

via arXiv — cs.CL