Object Counting with GPT-4o and GPT-5: A Comparative Study

arXiv — cs.CV•Thursday, December 4, 2025 at 5:00:00 AM

PositiveArtificial Intelligence

A comparative study has been conducted on the object counting capabilities of two multi-modal large language models, GPT-4o and GPT-5, focusing on their performance in zero-shot scenarios using only textual prompts. The evaluation was carried out on the FSC-147 and CARPK datasets, revealing that both models achieved results comparable to state-of-the-art methods, with some instances exceeding them.
This development highlights the potential of leveraging advanced language models for complex tasks like object counting without the need for extensive annotated data or visual examples, marking a significant step forward in AI capabilities.
The findings resonate with ongoing discussions in the AI community regarding the efficacy of large language models in various applications, including their role in enhancing vision-language synergy and addressing challenges in object recognition and counting, which are critical for advancing AI's practical applications.

— via World Pulse Now AI Editorial System

Read Original

Was this article worth reading? Share it

Airparser

Extract and parse data from documents using GPT-4 automation.

AI & DataTry the app

GPTHumanizer

Bypass AI detection with guaranteed undetectable content generation.

AI & DataTry the app

GPTHuman

Generate undetectable AI content that reads naturally and bypasses detection tools.

Business & ProductivityTry the app

Continue Readings

arXiv — cs.CVa day ago

Hierarchical Process Reward Models are Symbolic Vision Learners

PositiveArtificial Intelligence

A novel self-supervised symbolic auto-encoder has been introduced, enabling symbolic computer vision to interpret diagrams through structured representations and logical rules. This approach contrasts with traditional pixel-based visual models by parsing diagrams into geometric primitives, enhancing machine vision's interpretability.

Read full article

via arXiv — cs.CV

arXiv — cs.CVa day ago

Look, Recite, Then Answer: Enhancing VLM Performance via Self-Generated Knowledge Hints

PositiveArtificial Intelligence

A new framework called 'Look, Recite, Then Answer' has been proposed to enhance the performance of Vision-Language Models (VLMs) by generating self-generated knowledge hints. This approach aims to address the limitations of VLMs in specialized fields like precision agriculture, where reasoning-driven hallucination can hinder accurate visual perception.

Read full article

via arXiv — cs.CV

arXiv — cs.CVa day ago

DIQ-H: Evaluating Hallucination Persistence in VLMs Under Temporal Visual Degradation

NeutralArtificial Intelligence

The introduction of DIQ-H marks a significant advancement in evaluating the robustness of Vision-Language Models (VLMs) under conditions of temporal visual degradation, addressing critical failure modes such as hallucination persistence. This benchmark applies various physics-based corruptions to assess how VLMs recover from errors across multiple frames in dynamic environments.

Read full article

via arXiv — cs.CV

arXiv — cs.CVa day ago

Language-Driven Object-Oriented Two-Stage Method for Scene Graph Anticipation

PositiveArtificial Intelligence

A new method for Scene Graph Anticipation (SGA) has been introduced, termed Linguistic Scene Graph Anticipation (LSGA), which utilizes a language-driven framework to enhance the prediction of future scene graphs from video clips. This approach aims to improve the understanding of dynamic scenes by integrating semantic dynamics and commonsense temporal regularities, which are often difficult to extract from visual features alone.

Read full article

via arXiv — cs.CV

arXiv — cs.LGa day ago

A Definition of AGI

NeutralArtificial Intelligence

A recent paper has introduced a quantifiable framework for defining Artificial General Intelligence (AGI), proposing that AGI should match the cognitive versatility of a well-educated adult. This framework is based on the Cattell-Horn-Carroll theory and evaluates AI systems across ten cognitive domains, revealing significant gaps in current AI models, particularly in long-term memory storage.

Read full article

via arXiv — cs.LG

VentureBeat — AIa day ago

Anthropic vs. OpenAI red teaming methods reveal different security priorities for enterprise AI

NeutralArtificial Intelligence

Anthropic and OpenAI have recently showcased their respective AI models, Claude Opus 4.5 and GPT-5, highlighting their distinct approaches to security validation through system cards and red-team exercises. Anthropic's extensive 153-page system card contrasts with OpenAI's 60-page version, revealing differing methodologies in assessing AI robustness and security metrics.

Read full article

via VentureBeat — AI

arXiv — cs.CVa day ago

SpatialReasoner: Active Perception for Large-Scale 3D Scene Understanding

PositiveArtificial Intelligence

The introduction of SpatialReasoner marks a significant advancement in spatial reasoning for large-scale 3D environments, addressing challenges faced by existing vision-language models that are limited to smaller, room-scale scenarios. This framework utilizes the H$^2$U3D dataset, which encompasses multi-floor environments and generates diverse question-answer pairs to enhance 3D scene understanding.

Read full article

via arXiv — cs.CV

$\textit{ViRectify}: A Challenging Benchmark for Video Reasoning Correction with Multimodal Large Language Models$

arXiv — cs.CVa day ago

\textit{ViRectify}: A Challenging Benchmark for Video Reasoning Correction with Multimodal Large Language Models

PositiveArtificial Intelligence

The introduction of ViRectify marks a significant advancement in the evaluation of multimodal large language models (MLLMs) by providing a comprehensive benchmark for correcting video reasoning errors. This benchmark includes a dataset of over 30,000 instances across various domains, challenging MLLMs to identify errors and generate rationales grounded in video evidence.

Read full article

via arXiv — cs.CV