IFEval-Audio: Benchmarking Instruction-Following Capability in Audio-based Large Language Models

arXiv — cs.CL•Thursday, November 13, 2025 at 5:00:00 AM

The introduction of the IFEval-Audio dataset marks a significant step in evaluating instruction-following capabilities in audio-based large language models (LLMs). While large language models have shown proficiency in following instructions for text-based tasks, their performance often declines when integrated with non-text modalities like audio. This dataset, consisting of 280 audio-instruction-answer triples across six diverse dimensions—Content, Capitalization, Symbol, List Structure, Length, and Format—aims to benchmark state-of-the-art audio LLMs in this area. The public release of IFEval-Audio is crucial as it fills a notable gap in research, providing a foundation for future studies and advancements in the instruction-following performance of audio LLMs.

— via World Pulse Now AI Editorial System

Read Original

Was this article worth reading? Share it

Recommended Readings

arXiv — cs.CV15 hours ago

Let Language Constrain Geometry: Vision-Language Models as Semantic and Spatial Critics for 3D Generation

PositiveArtificial Intelligence

The paper introduces VLM3D, a novel framework that utilizes vision-language models (VLMs) to enhance text-to-3D generation. It addresses two major limitations in current models: the lack of fine-grained semantic alignment and inadequate 3D spatial understanding. VLM3D employs a dual-query critic signal to evaluate both semantic fidelity and geometric coherence, significantly improving the generation process. The framework demonstrates its effectiveness across different paradigms, marking a step forward in 3D generation technology.

Read full article

via arXiv — cs.CV

arXiv — cs.CV15 hours ago

GMAT: Grounded Multi-Agent Clinical Description Generation for Text Encoder in Vision-Language MIL for Whole Slide Image Classification

PositiveArtificial Intelligence

The article presents a new framework called GMAT, which enhances Multiple Instance Learning (MIL) for whole slide image (WSI) classification. By integrating vision-language models (VLMs), GMAT aims to improve the generation of clinical descriptions that are more expressive and medically specific. This addresses limitations in existing methods that rely on large language models (LLMs) for generating descriptions, which often lack domain grounding and detailed medical specificity, thus improving alignment with visual features.

Read full article

via arXiv — cs.CV

arXiv — cs.LG2 days ago

Doubly Debiased Test-Time Prompt Tuning for Vision-Language Models

PositiveArtificial Intelligence

The paper discusses the challenges of test-time prompt tuning for vision-language models, highlighting the issue of prompt optimization bias that can lead to suboptimal performance in downstream tasks. It identifies two main causes: the model's focus on entropy minimization, which may overlook prediction accuracy, and data misalignment between visual and textual modalities. To address these issues, the authors propose a new method called Doubly Debiased Test-Time Prompt Tuning, aimed at improving model performance in zero-shot settings.

Read full article

via arXiv — cs.LG

arXiv — stat.ML2 days ago

Understanding InfoNCE: Transition Probability Matrix Induced Feature Clustering

PositiveArtificial Intelligence

The article discusses InfoNCE, a key objective in contrastive learning, which is pivotal for unsupervised representation learning across various domains. Despite its success, the theoretical foundations of InfoNCE are not well established. This work introduces a feature space to model augmented views and a transition probability matrix to capture data augmentation dynamics. The authors propose SC-InfoNCE, a new loss function that allows flexible control over feature similarity alignment, enhancing the training process.

Read full article

via arXiv — stat.ML

arXiv — cs.CV3 days ago

Concept-as-Tree: A Controllable Synthetic Data Framework Makes Stronger Personalized VLMs

PositiveArtificial Intelligence

The paper titled 'Concept-as-Tree: A Controllable Synthetic Data Framework Makes Stronger Personalized VLMs' discusses the advancements in Vision-Language Models (VLMs) aimed at enhancing personalization. It highlights the challenges posed by the lack of user-provided positive samples and the poor quality of negative samples. To address these issues, the authors introduce the Concept-as-Tree (CaT) framework, which generates diverse positive and negative samples, thus improving VLM performance in personalization tasks.

Read full article

via arXiv — cs.CV

arXiv — cs.CV3 days ago

NeuS-QA: Grounding Long-Form Video Understanding in Temporal Logic and Neuro-Symbolic Reasoning

NeutralArtificial Intelligence

NeuS-QA is a new neuro-symbolic pipeline designed to enhance Long Video Question Answering (LVQA) by addressing the limitations of traditional vision-language models (VLMs). While VLMs perform well with single images and short videos, they struggle with LVQA due to the need for complex temporal reasoning. NeuS-QA offers a training-free, plug-and-play solution that improves interpretability by ensuring only logic-verified segments are processed by the VLM, thus enhancing the model's ability to understand long-form video content.

Read full article

via arXiv — cs.CV

arXiv — cs.CV3 days ago

Zero-Shot Temporal Interaction Localization for Egocentric Videos

PositiveArtificial Intelligence

The paper titled 'Zero-Shot Temporal Interaction Localization for Egocentric Videos' presents a novel approach called EgoLoc, aimed at improving the localization of human-object interactions in egocentric videos. Traditional methods rely heavily on annotated action and object categories, leading to domain bias and inefficiencies. EgoLoc introduces a self-adaptive sampling strategy to enhance visual prompts for vision-language model reasoning, ultimately achieving better temporal interaction localization.

Read full article

via arXiv — cs.CV

arXiv — cs.LG3 days ago

Human-Corrected Labels Learning: Enhancing Labels Quality via Human Correction of VLMs Discrepancies

PositiveArtificial Intelligence

The article discusses the introduction of Human-Corrected Labels (HCLs) to improve the quality of labels generated by Vision-Language Models (VLMs). It highlights the issues of low-quality labels and the lack of error correction in VLM outputs. The proposed method involves human intervention to correct discrepancies in VLM-generated labels, leading to enhanced annotation quality and reduced labor costs, supported by extensive experimental results.

Read full article

via arXiv — cs.LG