World PulseNowPowered by AI

Trending:

Beyond Description: Cognitively Benchmarking Fine-Grained Action for Embodied Agents

arXiv — cs.CV•Tuesday, November 25, 2025 at 5:00:00 AM

PositiveArtificial Intelligence

A new benchmark called CFG-Bench has been introduced to evaluate fine-grained action intelligence in Multimodal Large Language Models (MLLMs) for embodied agents. This benchmark includes 1,368 curated videos and 19,562 question-answer pairs, focusing on cognitive abilities such as physical interaction and evaluative judgment.
The development of CFG-Bench is significant as it addresses a critical gap in existing benchmarks that often overlook the nuanced decision-making required for physical interactions in complex environments, enhancing the capabilities of MLLMs.
This advancement reflects a broader trend in AI research towards improving reasoning and interaction capabilities in MLLMs, as seen in various initiatives aimed at enhancing spatial reasoning, understanding of social interactions, and multimodal retrieval, indicating a growing recognition of the need for comprehensive evaluation frameworks.

— via World Pulse Now AI Editorial System

Was this article worth reading? Share it

Recommended apps based on your readingExplore all apps

Guidejar-4eb95b

Build interactive product demos and help guides with AI assistance.

AI & DataTry the app

AIPortalX

Browse, compare, and use over 100 verified AI models with detailed insights and filtering.

Creative & DesignTry the app

Magicley AI

Access a suite of AI generators for all your creative and productivity tasks.

AI & DataTry the app

Continue Readings

SPINE: Token-Selective Test-Time Reinforcement Learning with Entropy-Band Regularization

arXiv — cs.LGa day ago

SPINE: Token-Selective Test-Time Reinforcement Learning with Entropy-Band Regularization

PositiveArtificial Intelligence

The SPINE framework introduces a token-selective approach to test-time reinforcement learning, addressing the challenges faced by large language models (LLMs) and multimodal LLMs (MLLMs) during distribution shifts at test-time. By focusing on high-entropy tokens and applying an entropy-band regularizer, SPINE aims to enhance model performance and maintain exploration during reinforcement learning processes.

Read full article

via arXiv — cs.LG

Vision-Motion-Reference Alignment for Referring Multi-Object Tracking via Multi-Modal Large Language Models

arXiv — cs.CVa day ago

Vision-Motion-Reference Alignment for Referring Multi-Object Tracking via Multi-Modal Large Language Models

PositiveArtificial Intelligence

A new framework named Vision-Motion-Reference aligned Referring Multi-Object Tracking (VMRMOT) has been proposed to enhance the performance of referring multi-object tracking (RMOT) by integrating motion dynamics with visual and language references using multi-modal large language models (MLLMs). This addresses the limitations of conventional RMOT, which struggles to account for dynamic changes in object motion.

Read full article

via arXiv — cs.CV

ReEXplore: Improving MLLMs for Embodied Exploration with Contextualized Retrospective Experience Replay

arXiv — cs.CVa day ago

ReEXplore: Improving MLLMs for Embodied Exploration with Contextualized Retrospective Experience Replay

PositiveArtificial Intelligence

The introduction of ReEXplore marks a significant advancement in embodied exploration by utilizing a training-free framework that enhances the decision-making capabilities of multimodal large language models (MLLMs) through retrospective experience replay and hierarchical frontier selection. This approach addresses the limitations of existing MLLMs, which struggle with outdated knowledge and complex action spaces.

Read full article

via arXiv — cs.CV

ReMatch: Boosting Representation through Matching for Multimodal Retrieval

arXiv — cs.CVa day ago

ReMatch: Boosting Representation through Matching for Multimodal Retrieval

PositiveArtificial Intelligence

ReMatch has been introduced as a framework that utilizes the generative capabilities of Multimodal Large Language Models (MLLMs) for enhanced multimodal retrieval. This approach trains the embedding MLLM end-to-end, incorporating a chat-style generative matching stage that assesses relevance from diverse inputs, thereby improving the quality of multimodal embeddings.

Read full article

via arXiv — cs.CV

PRISM-Bench: A Benchmark of Puzzle-Based Visual Tasks with CoT Error Detection

arXiv — cs.CVa day ago

PRISM-Bench: A Benchmark of Puzzle-Based Visual Tasks with CoT Error Detection

PositiveArtificial Intelligence

PRISM-Bench has been introduced as a new benchmark for evaluating multimodal large language models (MLLMs) through puzzle-based visual tasks that assess both problem-solving capabilities and reasoning processes. This benchmark specifically requires models to identify errors in a step-by-step chain of thought, enhancing the evaluation of logical consistency and visual reasoning.

Read full article

via arXiv — cs.CV

Multi-speaker Attention Alignment for Multimodal Social Interaction

arXiv — cs.CVa day ago

Multi-speaker Attention Alignment for Multimodal Social Interaction

PositiveArtificial Intelligence

A new method for enhancing social interaction understanding in videos has been proposed, focusing on the alignment of verbal and non-verbal cues in multi-speaker scenarios. This approach addresses the limitations observed in existing Multimodal Large Language Models (MLLMs), which struggle with cross-modal attention consistency in such contexts.

Read full article

via arXiv — cs.CV

Consolidating Diffusion-Generated Video Detection with Unified Multimodal Forgery Learning

arXiv — cs.CVa day ago

Consolidating Diffusion-Generated Video Detection with Unified Multimodal Forgery Learning

PositiveArtificial Intelligence

A new algorithm named MM-Det++ has been proposed to enhance the detection of videos generated by diffusion models, addressing the growing concerns over synthetic media and information security. This algorithm integrates a Spatio-Temporal branch utilizing a Frame-Centric Vision Transformer and a Multimodal branch for improved detection capabilities.

Read full article

via arXiv — cs.CV

RoadBench: Benchmarking MLLMs on Fine-Grained Spatial Understanding and Reasoning under Urban Road Scenarios

arXiv — cs.CVa day ago

RoadBench: Benchmarking MLLMs on Fine-Grained Spatial Understanding and Reasoning under Urban Road Scenarios

NeutralArtificial Intelligence

A new benchmark called RoadBench has been introduced to evaluate the fine-grained spatial understanding and reasoning capabilities of multimodal large language models (MLLMs) in urban road scenarios, focusing on road markings as a critical element. This benchmark includes six tasks with 9,121 manually verified test cases, utilizing BEV and FPV image inputs to assess MLLMs' performance.

Read full article

via arXiv — cs.CV