PET2Rep: Towards Vision-Language Model-Drived Automated Radiology Report Generation for Positron Emission Tomography

arXiv — cs.CVThursday, November 13, 2025 at 5:00:00 AM
The introduction of PET2Rep marks a pivotal advancement in the automation of radiology report generation for positron emission tomography (PET), a vital imaging technique in oncology and neurology. Traditional report creation is labor-intensive and time-consuming, which can hinder clinical decision-making. Recent developments in vision-language models (VLMs) have shown promise in medical applications, yet their use in PET imaging has been limited. PET2Rep addresses this gap by providing a large-scale benchmark dataset that uniquely captures whole-body image-report pairs with metabolic information. This dataset not only facilitates the evaluation of VLMs in generating accurate and informative reports but also introduces new clinical efficacy metrics to assess the quality of radiotracer uptake descriptions in key organs. By bridging the existing gaps in PET imaging resources, PET2Rep is set to enhance the efficiency and effectiveness of radiology practices.
— via World Pulse Now AI Editorial System

Was this article worth reading? Share it

Recommended Readings
Doubly Debiased Test-Time Prompt Tuning for Vision-Language Models
PositiveArtificial Intelligence
The paper discusses the challenges of test-time prompt tuning for vision-language models, highlighting the issue of prompt optimization bias that can lead to suboptimal performance. It identifies two main causes: the model's focus on entropy minimization, which may overlook prediction accuracy, and data misalignment between visual and textual modalities. To address these issues, the authors propose a new method called Doubly Debiased Test-Time Prompt Tuning, aimed at improving model performance in downstream tasks.
Evaluating Open-Weight Large Language Models for Structured Data Extraction from Narrative Medical Reports Across Multiple Use Cases and Languages
PositiveArtificial Intelligence
Large language models (LLMs) are increasingly utilized for extracting structured information from clinical records. A recent evaluation of 15 open-weight LLMs focused on pathology and radiology reports across six use cases, including colorectal liver metastases and neurodegenerative diseases, conducted in the Netherlands, UK, and Czech Republic. The study compared various prompting strategies and found that top models achieved macro-average scores close to inter-rater agreement, indicating their effectiveness in structured data extraction.
Bridging Hidden States in Vision-Language Models
PositiveArtificial Intelligence
Vision-Language Models (VLMs) are emerging models that integrate visual content with natural language. Current methods typically fuse data either early in the encoding process or late through pooled embeddings. This paper introduces a lightweight fusion module utilizing cross-only, bidirectional attention layers to align hidden states from both modalities, enhancing understanding while keeping encoders non-causal. The proposed method aims to improve the performance of VLMs by leveraging the inherent structure of visual and textual data.
FastDriveVLA: Efficient End-to-End Driving via Plug-and-Play Reconstruction-based Token Pruning
PositiveArtificial Intelligence
FastDriveVLA is a novel framework designed for efficient end-to-end autonomous driving through a reconstruction-based visual token pruning method. This approach addresses the high computational costs associated with long visual tokens in Vision-Language-Action (VLA) models. By focusing on retaining visual tokens that contain essential foreground information, FastDriveVLA aims to enhance decision-making in driving scenarios, marking a significant advancement in the application of VLA models in autonomous systems.
Concept-as-Tree: A Controllable Synthetic Data Framework Makes Stronger Personalized VLMs
PositiveArtificial Intelligence
The paper titled 'Concept-as-Tree: A Controllable Synthetic Data Framework Makes Stronger Personalized VLMs' discusses the advancements in Vision-Language Models (VLMs) aimed at enhancing personalization. It highlights the challenges posed by the lack of user-provided positive samples and the poor quality of negative samples. To address these issues, the authors introduce the Concept-as-Tree (CaT) framework, which generates diverse positive and negative samples, thus improving VLM performance in personalization tasks.
NeuS-QA: Grounding Long-Form Video Understanding in Temporal Logic and Neuro-Symbolic Reasoning
NeutralArtificial Intelligence
NeuS-QA is a new neuro-symbolic pipeline designed to enhance Long Video Question Answering (LVQA) by addressing the limitations of traditional vision-language models (VLMs). While VLMs perform well with single images and short videos, they struggle with LVQA due to the need for complex temporal reasoning. NeuS-QA offers a training-free, plug-and-play solution that improves interpretability by ensuring only logic-verified segments are processed by the VLM, thus enhancing the model's ability to understand long-form video content.
Human-Corrected Labels Learning: Enhancing Labels Quality via Human Correction of VLMs Discrepancies
PositiveArtificial Intelligence
The article discusses the introduction of Human-Corrected Labels (HCLs) to improve the quality of labels generated by Vision-Language Models (VLMs). It highlights the issues of low-quality labels and the lack of error correction in VLM outputs. The proposed method involves human intervention to correct discrepancies in VLM-generated labels, leading to enhanced annotation quality and reduced labor costs, supported by extensive experimental results.
Out-of-Distribution Detection with Positive and Negative Prompt Supervision Using Large Language Models
PositiveArtificial Intelligence
The paper discusses advancements in out-of-distribution (OOD) detection, focusing on the integration of visual and textual modalities through large language models (LLMs). It introduces a method called Positive and Negative Prompt Supervision, which aims to improve OOD detection by using class-specific prompts that capture inter-class features. This approach addresses the limitations of negative prompts that may include non-ID features, potentially leading to suboptimal outcomes.