ExAct: A Video-Language Benchmark for Expert Action Analysis

arXiv — cs.CV•Friday, December 12, 2025 at 5:00:00 AM

NeutralArtificial Intelligence

ExAct has been introduced as a new video-language benchmark aimed at enhancing expert-level understanding of skilled physical activities, featuring 3,521 curated video question-answer pairs across 11 activities in six domains, including sports and cooking. The benchmark requires nuanced comprehension, with the best-performing model, GPT-4o, achieving only 44.70% accuracy compared to 82.02% by human experts.
This development is significant as it highlights the existing performance gap between advanced visual language models and human expertise, underscoring the need for improved models that can accurately interpret complex human skills in various domains.
The introduction of ExAct aligns with ongoing discussions about the reliability and effectiveness of visual language models, particularly in their ability to handle nuanced tasks. As the field evolves, there is a growing emphasis on enhancing model capabilities through frameworks like STAR and LAST, which aim to improve understanding of spatial contexts and long video sequences.

— via World Pulse Now AI Editorial System

Read Original

Was this article worth reading? Share it

Airparser

Extract and parse data from documents using GPT-4 automation.

AI & DataView app details

LucidQuery AI

Combines diffusion reasoning with autoregressive LLM for advanced AI analysis.

AI & DataView app details

Exaecut

Craft custom GPU video effects for Premiere Pro and After Effects.

Creative & DesignView app details

LabEx

Master Linux, DevOps, and Cybersecurity through interactive, hands-on lab exercises.

Business & ProductivityView app details

Video Toolkit

AI copilot that analyzes videos to identify and extract viral-ready clips for your marketing.

Marketing & CommerceView app details

Guidejar-4eb95b

Build interactive product demos and help guides with AI assistance.

AI & DataView app details

Continue Readings

Bloomberg Technology2 days ago

AI Campus Developer Fermi Drops After Tenant Ends Agreement

NegativeArtificial Intelligence

Fermi Inc. experienced a significant drop of up to 46% on Friday following the announcement that a tenant had terminated a $150 million agreement related to its planned artificial intelligence campus in West Texas. The company, co-founded by former Texas governor Rick Perry, is now facing uncertainty regarding its future projects and financial stability.

Read full article

via Bloomberg Technology

arXiv — cs.CV2 days ago

Tool-Augmented Spatiotemporal Reasoning for Streamlining Video Question Answering Task

PositiveArtificial Intelligence

A new framework called the Spatiotemporal Reasoning Framework (STAR) has been introduced to enhance the capabilities of Multimodal Large Language Models (MLLMs) in Video Question Answering (VideoQA) tasks. This framework aims to improve the models' ability to understand spatial relationships and temporal dynamics in videos by strategically scheduling tool invocation sequences, thereby enhancing reasoning capabilities.

Read full article

via arXiv — cs.CV

arXiv — cs.CV2 days ago

Beyond Classification Accuracy: Neural-MedBench and the Need for Deeper Reasoning Benchmarks

NeutralArtificial Intelligence

Recent advancements in vision-language models (VLMs) have led to the introduction of Neural-MedBench, a benchmark designed to evaluate multimodal clinical reasoning in neurology. This benchmark incorporates multi-sequence MRI scans, structured electronic health records, and clinical notes, focusing on tasks such as differential diagnosis and lesion recognition.

Read full article

via arXiv — cs.CV

arXiv — cs.CL2 days ago

Teaching Language Models to Evolve with Users: Dynamic Profile Modeling for Personalized Alignment

PositiveArtificial Intelligence

A new framework called Reinforcement Learning for Personalized Alignment (RLPA) has been introduced to enhance the personalization of large language models (LLMs) by allowing them to interact with simulated user models. This approach enables LLMs to refine user profiles through dialogue, guided by a dual-level reward structure that promotes accurate user representation and contextually relevant responses.

Read full article

via arXiv — cs.CL

arXiv — cs.CV2 days ago

Towards Fine-Grained Recognition with Large Visual Language Models: Benchmark and Optimization Strategies

PositiveArtificial Intelligence

Large Vision Language Models (LVLMs) have advanced significantly, particularly in vision-language interactions and dialogue applications. However, existing benchmarks have largely overlooked fine-grained recognition, which is essential for real-world applications. To fill this gap, researchers have introduced the Fine-grained Recognition Open World (FROW) benchmark, aimed at evaluating LVLMs more comprehensively, particularly using the GPT-4o model.

Read full article

via arXiv — cs.CV

arXiv — cs.CV2 days ago

BabyVLM-V2: Toward Developmentally Grounded Pretraining and Benchmarking of Vision Foundation Models

PositiveArtificial Intelligence

BabyVLM-V2 has been introduced as a developmentally grounded framework for vision-language modeling, significantly enhancing its predecessor, BabyVLM-V1. This new model utilizes a comprehensive pretraining set designed to reflect infant experiences through audiovisual data, alongside the DevCV Toolbox for cognitive evaluation, which includes ten multimodal tasks aligned with early childhood capabilities.

Read full article

via arXiv — cs.CV

arXiv — cs.LG2 days ago

From Lab to Reality: A Practical Evaluation of Deep Learning Models and LLMs for Vulnerability Detection

NeutralArtificial Intelligence

A recent study evaluated the effectiveness of deep learning models and large language models (LLMs) for vulnerability detection, focusing on models like ReVeal and LineVul across four datasets: Juliet, Devign, BigVul, and ICVul. The research highlights the gap between benchmark performance and real-world applicability, emphasizing the need for systematic evaluation in practical scenarios.

Read full article

via arXiv — cs.LG

arXiv — cs.CV2 days ago

Looking Beyond Visible Cues: Implicit Video Question Answering via Dual-Clue Reasoning

PositiveArtificial Intelligence

A new task and dataset called Implicit Video Question Answering (I-VQA) has been introduced to address the challenges in Video Question Answering (VideoQA) where explicit visual evidence is not available. This innovative approach utilizes contextual visual cues to answer questions related to symbolic meanings or deeper intentions within videos, marking a significant advancement in the field.

Read full article

via arXiv — cs.CV

Ready to build your own newsroom?

Subscribe to unlock a personalised feed, podcasts, newsletters, and notifications tailored to the topics you actually care about