MOCHA: Multi-modal Objects-aware Cross-arcHitecture Alignment

arXiv — cs.LG•Monday, November 24, 2025 at 5:00:00 AM

PositiveArtificial Intelligence

MOCHA, a new distillation framework, has been introduced to enhance personalized object detection by transferring multimodal knowledge from a frozen vision-language model (VLM) to a lightweight vision-only detector. This approach enables the effective recognition of user-specific instances from minimal examples without requiring modifications to the teacher model during inference.
The significance of MOCHA lies in its ability to bridge the gap between the computational demands of large VLMs and the efficiency required for real-time applications, making personalized object detection more accessible and practical for various use cases.
This development reflects a broader trend in artificial intelligence where researchers are increasingly focused on optimizing the performance of VLMs and other models for specific tasks, such as video classification and clinical description generation, while addressing challenges related to efficiency and real-time processing.

— via World Pulse Now AI Editorial System

Read Original

Was this article worth reading? Share it

LucidQuery AI

Combines diffusion reasoning with autoregressive LLM for advanced AI analysis.

AI & DataTry the app

Dyad

Build and deploy free, local AI applications with open-source tools.

AI & DataTry the app

OpenL Translator

Instantly translate text from images of signs and menus with accuracy.

AI & DataTry the app

Continue Readings

arXiv — cs.CVa day ago

MultiPriv: Benchmarking Individual-Level Privacy Reasoning in Vision-Language Models

NeutralArtificial Intelligence

The introduction of MultiPriv marks a significant advancement in the evaluation of individual-level privacy reasoning within Vision-Language Models (VLMs). This benchmark addresses the inadequacies of current privacy assessments, which primarily focus on privacy perception rather than the ability of VLMs to link distributed information and construct individual profiles. The framework includes a novel bilingual multimodal dataset that features synthetic individual profiles linked to sensitive attributes.

Read full article

via arXiv — cs.CV

arXiv — cs.CVa day ago

PhyBlock: A Progressive Benchmark for Physical Understanding and Planning via 3D Block Assembly

NeutralArtificial Intelligence

PhyBlock has been introduced as a progressive benchmark aimed at evaluating vision-language models (VLMs) on their physical understanding and planning capabilities through robotic 3D block assembly tasks. This benchmark features a four-level cognitive hierarchy assembly task and includes 2,600 tasks to assess models on spatial reasoning and physical comprehension.

Read full article

via arXiv — cs.CV

arXiv — cs.CVa day ago

Do Vision-Language Models Understand Visual Persuasiveness?

NeutralArtificial Intelligence

Recent research has examined whether Vision-Language Models (VLMs) comprehend visual persuasion, which influences human attitudes and decisions. A new dataset was created for binary persuasiveness judgment, introducing a taxonomy of Visual Persuasive Factors (VPFs) that includes various levels of visual cues. The analysis indicates that VLMs tend to overestimate high persuasiveness and struggle with low/mid-level features, while high-level semantic alignment is a strong predictor of human judgment.

Read full article

via arXiv — cs.CV

arXiv — cs.CVa day ago

Lost in Translation and Noise: A Deep Dive into the Failure Modes of VLMs on Real-World Tables

NeutralArtificial Intelligence

The introduction of MirageTVQA, a new benchmark for evaluating Vision-Language Models (VLMs), highlights the significant performance gaps in existing datasets that primarily focus on monolingual and visually perfect tables. This benchmark includes nearly 60,000 QA pairs across 24 languages and incorporates realistic noise to better reflect real-world scenarios.

Read full article

via arXiv — cs.CV

arXiv — cs.CVa day ago

VLA-Pruner: Temporal-Aware Dual-Level Visual Token Pruning for Efficient Vision-Language-Action Inference

PositiveArtificial Intelligence

VLA-Pruner has been introduced as a novel method for token pruning in Vision-Language-Action (VLA) models, addressing the inefficiencies of existing approaches that focus solely on semantic salience. This method aims to enhance real-time deployment of VLA models by retaining critical information necessary for action generation while discarding redundant visual tokens.

Read full article

via arXiv — cs.CV