MTBBench: A Multimodal Sequential Clinical Decision-Making Benchmark in Oncology

arXiv — cs.LG•Wednesday, November 26, 2025 at 5:00:00 AM

PositiveArtificial Intelligence

MTBBench has been introduced as a new benchmark designed to simulate decision-making in Molecular Tumor Boards (MTBs), addressing the limitations of existing evaluations that focus on unimodal question-answering. This benchmark incorporates multimodal and longitudinal oncology questions, validated by clinicians through a co-developed application.
The development of MTBBench is significant as it aims to enhance the reliability of Multimodal Large Language Models (LLMs) in clinical settings, particularly in oncology, where integrating diverse data and expert insights is crucial for accurate diagnostics and prognostics.
This initiative reflects a growing recognition of the need for more sophisticated evaluation frameworks in AI, particularly for applications in healthcare. As the field of multimodal AI evolves, benchmarks like MTBBench are essential for addressing complex real-world scenarios, ensuring that LLMs can effectively support clinical decision-making processes.

— via World Pulse Now AI Editorial System

Read Original

Was this article worth reading? Share it

Nudge AI

Automatically transcribe and summarize medical conversations for healthcare professionals.

Business & ProductivityTry the app

Augmeta

AI peers for collaborative problem-solving and enhanced team productivity.

AI & DataTry the app

Hypertune

Optimize machine learning models with automated hyperparameter tuning and experiment tracking.

Business & ProductivityTry the app

Continue Readings

arXiv — cs.CVa day ago

VideoChat-M1: Collaborative Policy Planning for Video Understanding via Multi-Agent Reinforcement Learning

PositiveArtificial Intelligence

The introduction of VideoChat-M1 represents a significant advancement in video understanding through a novel multi-agent system that employs Collaborative Policy Planning (CPP). This system allows multiple agents to generate, execute, and communicate unique tool invocation policies tailored to user queries, enhancing the exploration of complex video content.

Read full article

via arXiv — cs.CV

arXiv — cs.CVa day ago

Vision-Language Models for Automated 3D PET/CT Report Generation

PositiveArtificial Intelligence

A new framework named PETRG-3D has been proposed for automated 3D PET/CT report generation, addressing the growing need for efficient reporting in oncology due to a shortage of trained specialists. This model utilizes a dual-branch architecture to separately encode PET and CT volumes while incorporating style-adaptive prompts to standardize reporting across different hospitals.

Read full article

via arXiv — cs.CV

arXiv — cs.CV2 days ago

Multi-speaker Attention Alignment for Multimodal Social Interaction

PositiveArtificial Intelligence

A new method for enhancing social interaction understanding in videos has been proposed, focusing on the alignment of verbal and non-verbal cues in multi-speaker scenarios. This approach addresses the limitations observed in existing Multimodal Large Language Models (MLLMs), which struggle with cross-modal attention consistency in such contexts.

Read full article

via arXiv — cs.CV

arXiv — cs.CV2 days ago

Consolidating Diffusion-Generated Video Detection with Unified Multimodal Forgery Learning

PositiveArtificial Intelligence

A new algorithm named MM-Det++ has been proposed to enhance the detection of videos generated by diffusion models, addressing the growing concerns over synthetic media and information security. This algorithm integrates a Spatio-Temporal branch utilizing a Frame-Centric Vision Transformer and a Multimodal branch for improved detection capabilities.

Read full article

via arXiv — cs.CV

arXiv — cs.CV2 days ago

ChineseVideoBench: Benchmarking Multi-modal Large Models for Chinese Video Question Answering

PositiveArtificial Intelligence

The introduction of ChineseVideoBench marks a significant advancement in the evaluation of Multimodal Large Language Models (MLLMs) specifically for Chinese Video Question Answering. This benchmark provides a comprehensive dataset and tailored metrics, addressing the need for culturally-aware evaluation frameworks in video analysis.

Read full article

via arXiv — cs.CV

arXiv — cs.CV2 days ago

Multimodal Continual Learning with MLLMs from Multi-scenario Perspectives

PositiveArtificial Intelligence

A new study has introduced a multimodal visual understanding dataset (MSVQA) aimed at addressing catastrophic forgetting in Multimodal Large Language Models (MLLMs) by adapting to various scenarios such as high altitude, underwater, low altitude, and indoor settings. The proposed method, UNIFIER, seeks to enhance visual learning by decoupling visual information into distinct branches within each vision block.

Read full article

via arXiv — cs.CV

arXiv — cs.CV2 days ago

DocPTBench: Benchmarking End-to-End Photographed Document Parsing and Translation

NeutralArtificial Intelligence

The introduction of DocPTBench marks a significant advancement in the benchmarking of end-to-end photographed document parsing and translation, addressing the limitations of existing benchmarks that primarily focus on pristine scanned documents. This new benchmark includes over 1,300 high-resolution photographed documents and eight translation scenarios, with human-verified annotations for improved accuracy.

Read full article

via arXiv — cs.CV

arXiv — cs.CV2 days ago

Beyond Description: Cognitively Benchmarking Fine-Grained Action for Embodied Agents

PositiveArtificial Intelligence

A new benchmark called CFG-Bench has been introduced to evaluate fine-grained action intelligence in Multimodal Large Language Models (MLLMs) for embodied agents. This benchmark includes 1,368 curated videos and 19,562 question-answer pairs, focusing on cognitive abilities such as physical interaction and evaluative judgment.

Read full article

via arXiv — cs.CV