World PulseNowPowered by AI

Trending:

EventBench: Towards Comprehensive Benchmarking of Event-based MLLMs

arXiv — cs.CV•Tuesday, November 25, 2025 at 5:00:00 AM

NeutralArtificial Intelligence

A new benchmark called EventBench has been introduced to evaluate the capabilities of multimodal large language models (MLLMs) in event-based vision. This benchmark features eight diverse task metrics and a large-scale event stream dataset, aiming to provide a comprehensive assessment of MLLMs' performance across various tasks, including understanding, recognition, and spatial reasoning.
The introduction of EventBench is significant as it addresses the current gap in comprehensive evaluation frameworks for MLLMs, allowing researchers and developers to better understand and enhance the capabilities of these models. By providing open access to raw event streams and task instructions, it promotes transparency and collaboration in the AI research community.
This development reflects a broader trend in AI research towards creating more robust and scalable evaluation frameworks. As MLLMs continue to evolve, the need for diverse and comprehensive benchmarks becomes increasingly critical. The integration of spatial reasoning tasks and large-scale datasets in EventBench aligns with ongoing efforts to improve the performance of AI models in complex, real-world scenarios, highlighting the importance of interdisciplinary approaches in advancing AI technologies.

— via World Pulse Now AI Editorial System

Was this article worth reading? Share it

Recommended apps based on your readingExplore all apps

LangWatch

Monitor and improve your AI applications for quality, safety, and reliability.

AI & DataTry the app

Promptmonitor

Monitor and optimize your brand's visibility across ChatGPT, Gemini, and other AI platforms.

Marketing & CommerceTry the app

ChatOne

Chat with multiple AI models like ChatGPT, Claude, and Gemini in one place.

AI & DataTry the app

Continue Readings

Multi-speaker Attention Alignment for Multimodal Social Interaction

arXiv — cs.CVa day ago

Multi-speaker Attention Alignment for Multimodal Social Interaction

PositiveArtificial Intelligence

A new method for enhancing social interaction understanding in videos has been proposed, focusing on the alignment of verbal and non-verbal cues in multi-speaker scenarios. This approach addresses the limitations observed in existing Multimodal Large Language Models (MLLMs), which struggle with cross-modal attention consistency in such contexts.

Read full article

via arXiv — cs.CV

VisReason: A Large-Scale Dataset for Visual Chain-of-Thought Reasoning

arXiv — cs.LGa day ago

VisReason: A Large-Scale Dataset for Visual Chain-of-Thought Reasoning

PositiveArtificial Intelligence

A new dataset named VisReason has been introduced to enhance visual Chain-of-Thought (CoT) reasoning in multimodal large language models (MLLMs). Comprising 489,000 annotated examples across four domains, VisReason aims to facilitate complex reasoning by providing multi-round, human-like rationales that guide MLLMs through visual reasoning steps. Additionally, a subset called VisReason-Pro, featuring 165,000 examples, has been curated with expert-level annotations.

Read full article

via arXiv — cs.LG

Health system learning achieves generalist neuroimaging models

arXiv — cs.LGa day ago

Health system learning achieves generalist neuroimaging models

PositiveArtificial Intelligence

Recent advancements in artificial intelligence have led to the development of NeuroVFM, a generalist neuroimaging model trained on 5.24 million clinical MRI and CT volumes. This model was created through a novel approach called health system learning, which utilizes uncurated data from routine clinical care, addressing the limitations faced by existing AI models that lack access to private clinical data.

Read full article

via arXiv — cs.LG

EventSTU: Event-Guided Efficient Spatio-Temporal Understanding for Video Large Language Models

arXiv — cs.CVa day ago

EventSTU: Event-Guided Efficient Spatio-Temporal Understanding for Video Large Language Models

PositiveArtificial Intelligence

A new framework named EventSTU has been introduced to enhance the efficiency of video large language models (VLLMs) by employing event-guided spatio-temporal understanding. This approach utilizes a coarse-to-fine keyframe sampling algorithm and an adaptive token pruning algorithm to reduce redundant frames and optimize spatial data processing, respectively. Additionally, EventBench, a multimodal benchmark, has been created to evaluate this framework's performance in real-world scenarios.

Read full article

via arXiv — cs.CV

Chain-of-Visual-Thought: Teaching VLMs to See and Think Better with Continuous Visual Tokens

arXiv — cs.CVa day ago

Chain-of-Visual-Thought: Teaching VLMs to See and Think Better with Continuous Visual Tokens

PositiveArtificial Intelligence

A new framework called Chain-of-Visual-Thought (COVT) has been introduced to enhance Vision-Language Models (VLMs) by enabling them to reason with continuous visual tokens, which encapsulate rich perceptual cues. This approach aims to address the limitations of current VLMs in dense visual perception tasks, such as spatial reasoning and geometric awareness, by distilling knowledge from lightweight vision experts within a budget of approximately 20 tokens.

Read full article

via arXiv — cs.CV

Evaluating Large Language Models on the 2026 Korean CSAT Mathematics Exam: Measuring Mathematical Ability in a Zero-Data-Leakage Setting

arXiv — cs.CLa day ago

Evaluating Large Language Models on the 2026 Korean CSAT Mathematics Exam: Measuring Mathematical Ability in a Zero-Data-Leakage Setting

PositiveArtificial Intelligence

A recent study evaluated the mathematical reasoning capabilities of Large Language Models (LLMs) using the 2026 Korean College Scholastic Ability Test (CSAT) Mathematics section, ensuring a contamination-free evaluation environment. The research involved digitizing all 46 questions immediately after the exam's public release, allowing for a rigorous assessment of 24 state-of-the-art LLMs across various input modalities and languages.

Read full article

via arXiv — cs.CL

Beyond Multiple Choice: Verifiable OpenQA for Robust Vision-Language RFT

arXiv — cs.CLa day ago

Beyond Multiple Choice: Verifiable OpenQA for Robust Vision-Language RFT

PositiveArtificial Intelligence

A new framework called ReVeL (Rewrite and Verify by LLM) has been proposed to enhance the multiple-choice question answering (MCQA) format used in evaluating multimodal language models. This framework transforms MCQA into open-form questions while ensuring answers remain verifiable, addressing issues of answer guessing and unreliable accuracy metrics during reinforcement fine-tuning (RFT).

Read full article

via arXiv — cs.CL

EvoLMM: Self-Evolving Large Multimodal Models with Continuous Rewards

arXiv — cs.CV2 days ago

EvoLMM: Self-Evolving Large Multimodal Models with Continuous Rewards

PositiveArtificial Intelligence

EvoLMM, a self-evolving framework for large multimodal models, has been introduced to enhance reasoning capabilities without relying on human-annotated data. This framework consists of two cooperative agents: a Proposer that generates diverse questions and a Solver that answers them through a continuous self-rewarding process. This innovation aims to improve the autonomy and scalability of multimodal models.

Read full article

via arXiv — cs.CV