arXiv:2511.10983v1 Announce Type: new 
Abstract: We propose a training-free, binary verification workflow for zero-shot vision with off-the-shelf VLMs. It comprises two steps: (i) quantization, which turns the open-ended query into a multiple-choice question (MCQ) with a small, explicit list of unambiguous candidates; and (ii) binarization, which asks one True/False question per candidate and resolves deterministically: if exactly one is True, select it; otherwise, revert to an MCQ over the remaining plausible candidates. We evaluate the workflow on referring expression grounding (REC), spatial reasoning (Spatial-Map, Spatial-Grid, Spatial-Maze), and BLINK-Jigsaw. Relative to answering open-ended queries directly, quantization to MCQ yields large gains, and True/False binarization provides a consistent additional boost. Across all tasks, the same workflow produces significant improvements, indicating generality. Our theory formalizes how open-ended vision queries can be quantized to MCQs and further binarized into True/False verifications, establishing a hardness ladder. A simple analysis explains why Boolean resolution boosts accuracy. Together, these components yield a simple and unified workflow that emphasizes inference-time design over task-specific training. It offers a practical, drop-in path to stronger zero-shot vision with today's VLMs.

تم اقتراح سير عمل جديد للتحقق الثنائي بدون تدريب لرؤية صفرية، باستخدام نماذج اللغة المرئية المتاحة في السوق. يتكون سير العمل من خطوتين رئيسيتين: التكميم، الذي يحول الاستفسارات المفتوحة إلى أسئلة متعددة الخيارات، والتثبيت الثنائي، الذي يقيم المرشحين بأسئلة صحيحة/خاطئة. تم تقييم هذه الطريقة عبر مهام متنوعة، بما في ذلك توجيه التعبيرات المرجعية والتفكير المكاني، مما يظهر تحسينات كبيرة في الأداء مقارنة بالطرق التقليدية للاستفسارات المفتوحة.

Se ha propuesto un nuevo flujo de verificación binaria sin entrenamiento para la visión de cero disparos, utilizando Modelos de Lenguaje Visual (VLM) disponibles en el mercado. El flujo consta de dos pasos principales: cuantificación, que convierte consultas abiertas en preguntas de opción múltiple (MCQ), y binarización, que evalúa candidatos con preguntas de Verdadero/Falso. Este método se ha evaluado en diversas tareas, incluyendo la anclaje de expresiones referenciales y el razonamiento espacial, mostrando mejoras significativas en el rendimiento en comparación con los métodos tradicionales de consultas abiertas.

Un nouveau flux de vérification binaire sans entraînement pour la vision zéro-shot a été proposé, utilisant des modèles de langage visuel (VLM) disponibles dans le commerce. Le flux se compose de deux étapes principales : la quantification, qui transforme les requêtes ouvertes en questions à choix multiples (QCM), et la binarisation, qui évalue les candidats par des questions Vrai/Faux. Cette méthode a été évaluée sur diverses tâches, y compris le ancrage d'expressions référentielles et le raisonnement spatial, montrant des améliorations significatives par rapport aux méthodes traditionnelles de requêtes ouvertes.

A new training-free binary verification workflow for zero-shot vision has been proposed, utilizing off-the-shelf Vision Language Models (VLMs). The workflow consists of two main steps: quantization, which converts open-ended queries into multiple-choice questions (MCQs), and binarization, which evaluates candidates with True/False questions. This method has been evaluated across various tasks, including referring expression grounding and spatial reasoning, showing significant improvements in performance compared to traditional open-ended query methods.

Binary Verification for Zero-Shot Vision

The service is customized for teachers' needs and includes added security and privacy, a collaborative workspace, and more.

OpenAI expands free educational offerings - here's what ChatGPT for Teachers can do

GPT-5.1-Codex-Max is ready to take on your next massive coding job. Here's what's new.

OpenAI's Codex Max solves one of my biggest AI coding annoyances - and adds dramatically faster performance

The agent offers one-click buying for all your holiday needs and will be free for all US-based users.

Perplexity's AI shopping tool is free for all now, just in time for Black Friday - how to use it

If aesthetics and efficiency top your list of needs, there are several Linux distributions that are right up your alley. Both Ubuntu Budgie and Pop!_OS should top that list.

Ubuntu Budgie vs. Pop!_OS: I've used both Linux distros - here's how to choose

The Nomad Stratos Band might just be my favorite Apple Watch band ever. Here's what makes it special.

My search for the ultimate Apple Watch band is over: This one checks all the boxes for me

With blazing performance and liquid cooling, Redmagic's 11 Pro boasts the best mobile gaming I've experienced.

I'm a diehard Pixel user, but this liquid-cooled Android phone has my attention

arXiv:2511.10946v1 Announce Type: new 
Abstract: Vision-language models (VLMs) struggle with 3D-related tasks such as spatial cognition and physical understanding, which are crucial for real-world applications like robotics and embodied agents. We attribute this to a modality gap between the 3D tasks and the 2D training of VLM, which led to inefficient retrieval of 3D information from 2D input. To bridge this gap, we introduce SandboxVLM, a simple yet effective framework that leverages abstract bounding boxes to encode geometric structure and physical kinematics for VLM. Specifically, we design a 3D Sandbox reconstruction and perception pipeline comprising four stages: generating multi-view priors with abstract control, proxy elevation, multi-view voting and clustering, and 3D-aware reasoning. Evaluated in zero-shot settings across multiple benchmarks and VLM backbones, our approach consistently improves spatial intelligence, achieving an 8.3\% gain on SAT Real compared with baseline methods for instance. These results demonstrate that equipping VLMs with a 3D abstraction substantially enhances their 3D reasoning ability without additional training, suggesting new possibilities for general-purpose embodied intelligence.

تواجه نماذج الرؤية-اللغة (VLMs) تحديات في المهام ثلاثية الأبعاد مثل الإدراك المكاني والفهم الفيزيائي، وهي ضرورية للتطبيقات في الروبوتات والعوامل المتجسدة. تنشأ هذه الصعوبة من فجوة بين المهام ثلاثية الأبعاد والتدريب ثنائي الأبعاد لنماذج VLM، مما يؤدي إلى استرجاع غير فعال للمعلومات ثلاثية الأبعاد. لمعالجة هذه المشكلة، تم تقديم إطار SandboxVLM، الذي يستخدم صناديق تجريدية لتحسين الهيكل الهندسي والحركية الفيزيائية، مما يؤدي إلى تحسين الذكاء المكاني وزيادة الأداء بنسبة 8.3% في معيار SAT Real.

Los modelos de visión-lenguaje (VLM) enfrentan desafíos en tareas 3D como la cognición espacial y la comprensión física, que son esenciales para aplicaciones en robótica y agentes incorporados. Esta dificultad se debe a una brecha modal entre las tareas 3D y el entrenamiento 2D de los VLM, lo que lleva a una recuperación ineficiente de información 3D. Para abordar esto, se presenta el marco SandboxVLM, que utiliza cajas delimitadoras abstractas para mejorar la estructura geométrica y la cinemática física, resultando en una mejora de la inteligencia espacial y un aumento del 8.3% en el rendimie…

Les modèles de vision-langage (VLM) rencontrent des difficultés avec des tâches 3D telles que la cognition spatiale et la compréhension physique, essentielles pour des applications en robotique et agents incarnés. Cette difficulté provient d'un écart modal entre les tâches 3D et l'entraînement 2D des VLM, entraînant une récupération inefficace des informations 3D. Pour y remédier, le cadre SandboxVLM est introduit, utilisant des boîtes englobantes abstraites pour améliorer la structure géométrique et la cinématique physique, ce qui entraîne une amélioration de l'intelligence spatiale et un gai…

Vision-language models (VLMs) face challenges in 3D tasks such as spatial cognition and physical understanding, essential for applications in robotics and embodied agents. This difficulty arises from a modality gap between 3D tasks and the 2D training of VLMs, leading to inefficient retrieval of 3D information. To address this, the SandboxVLM framework is introduced, utilizing abstract bounding boxes to enhance geometric structure and physical kinematics, resulting in improved spatial intelligence and an 8.3% performance gain on the SAT Real benchmark.

Abstract 3D Perception for Spatial Intelligence in Vision-Language Models

arXiv:2511.14109v1 Announce Type: new 
Abstract: Visual Place Recognition (VPR) aims to match query images against a database using visual cues. State-of-the-art methods aggregate features from deep backbones to form global descriptors. Optimal transport-based aggregation methods reformulate feature-to-cluster assignment as a transport problem, but the standard Sinkhorn algorithm symmetrically treats source and target marginals, limiting effectiveness when image features and cluster centers exhibit substantially different distributions. We propose an asymmetric aggregation VPR method with geometric constraints for locally aggregated descriptors, called $A^2$GC-VPR. Our method employs row-column normalization averaging with separate marginal calibration, enabling asymmetric matching that adapts to distributional discrepancies in visual place recognition. Geometric constraints are incorporated through learnable coordinate embeddings, computing compatibility scores fused with feature similarities, thereby promoting spatially proximal features to the same cluster and enhancing spatial awareness. Experimental results on MSLS, NordLand, and Pittsburgh datasets demonstrate superior performance, validating the effectiveness of our approach in improving matching accuracy and robustness.

$A^2$GC-VPR هو أسلوب جديد للتعرف على الأماكن البصرية (VPR) يتناول قيود أساليب التجميع التقليدية في مطابقة صور الاستعلام مع قاعدة بيانات. من خلال اعتماد نهج تجميع غير متماثل مع قيود هندسية، يعزز هذا الأسلوب فعالية مطابقة الميزات، خاصة عند التعامل مع توزيعات متباينة لميزات الصورة ومراكز التجمع. تستخدم التقنية متوسطات تطبيع الصفوف والأعمدة مع تضمينات إحداثيات قابلة للتعلم لتحسين درجات التوافق لوصفيات التجميع المحلي.

$A^2$GC-VPR es un nuevo método para el Reconocimiento Visual de Lugares (VPR) que aborda las limitaciones de los métodos de agregación tradicionales al emparejar imágenes de consulta con una base de datos. Al emplear un enfoque de agregación asimétrica con restricciones geométricas, este método mejora la efectividad del emparejamiento de características, especialmente cuando se enfrentan a distribuciones variables de características de imagen y centros de clúster. La técnica utiliza promedios de normalización fila-columna y embeddings de coordenadas aprendibles para mejorar las puntuaciones de…

$A^2$GC-VPR est une nouvelle méthode pour la reconnaissance de lieux visuels (VPR) qui s'attaque aux limites des méthodes d'agrégation traditionnelles dans l'appariement d'images de requête à une base de données. En adoptant une approche d'agrégation asymétrique avec des contraintes géométriques, cette méthode améliore l'efficacité de l'appariement des caractéristiques, en particulier lorsqu'il s'agit de distributions variées des caractéristiques d'image et des centres de clusters. La technique utilise une moyenne de normalisation ligne-colonne et des embeddings de coordonnées apprenables pour…

$A^2$GC-VPR is a new method for Visual Place Recognition (VPR) that addresses the limitations of traditional aggregation methods in matching query images to a database. By employing an asymmetric aggregation approach with geometric constraints, this method enhances the effectiveness of feature matching, particularly when dealing with varying distributions of image features and cluster centers. The technique utilizes row-column normalization averaging and learnable coordinate embeddings to improve compatibility scores for locally aggregated descriptors.

$A^2$GC: $A$symmetric $A$ggregation with Geometric Constraints for Locally Aggregated Descriptors

arXiv:2511.14247v1 Announce Type: new 
Abstract: Multi-agents rely on accurate poses to share and align observations, enabling a collaborative perception of the environment. However, traditional GNSS-based localization often fails in GNSS-denied environments, making consistent feature alignment difficult in collaboration. To tackle this challenge, we propose a robust GNSS-free collaborative perception framework based on LiDAR localization. Specifically, we propose a lightweight Pose Generator with Confidence (PGC) to estimate compact pose and confidence representations. To alleviate the effects of localization errors, we further develop the Pose-Aware Spatio-Temporal Alignment Transformer (PASTAT), which performs confidence-aware spatial alignment while capturing essential temporal context. Additionally, we present a new simulation dataset, V2VLoc, which can be adapted for both LiDAR localization and collaborative detection tasks. V2VLoc comprises three subsets: Town1Loc, Town4Loc, and V2VDet. Town1Loc and Town4Loc offer multi-traversal sequences for training in localization tasks, whereas V2VDet is specifically intended for the collaborative detection task. Extensive experiments conducted on the V2VLoc dataset demonstrate that our approach achieves state-of-the-art performance under GNSS-denied conditions. We further conduct extended experiments on the real-world V2V4Real dataset to validate the effectiveness and generalizability of PASTAT.

يقدم المقال إطارًا جديدًا للإدراك التعاوني بدون GNSS باستخدام تحديد المواقع بواسطة LiDAR، حيث يتناول التحديات التي تواجهها البيئات التي تفتقر إلى GNSS. غالبًا ما تواجه طرق تحديد المواقع التقليدية صعوبات في هذه البيئات، مما يعيق التعاون الفعال بين أنظمة الوكلاء المتعددة. تتضمن الحلول المقترحة مولد وضع خفيف الوزن مع ثقة (PGC) لتقدير الأوضاع وتمثيلات الثقة، بالإضافة إلى محول التوافق الزماني المكاني الواعي بالوضع (PASTAT) الذي يقوم بأداء التوافق المكاني مع مراعاة الثقة. كما تم تقديم مجموعة بيانات محاكاة جديدة، V2VLoc، التي يمكن تكييفها لمهام تحديد المواقع بواسطة LiDAR والاكتشاف التعاوني.

El artículo presenta un nuevo marco para la percepción colaborativa sin GNSS utilizando la localización por LiDAR, abordando los desafíos que se enfrentan en entornos sin GNSS. Los métodos de localización tradicionales a menudo tienen dificultades en estos entornos, lo que dificulta la colaboración efectiva entre sistemas multiagente. La solución propuesta incluye un Generador de Pose con Confianza (PGC) para estimar poses y confianza, junto con el Transformador de Alineación Espacio-Temporal Consciente de la Pose (PASTAT) para el alineamiento espacial. Se introduce un nuevo conjunto de datos …

L'article présente un nouveau cadre pour la perception collaborative sans GNSS utilisant la localisation par LiDAR, abordant les défis rencontrés dans les environnements privés de GNSS. Les méthodes de localisation traditionnelles peinent souvent dans ces contextes, entravant la collaboration efficace entre systèmes multi-agents. La solution proposée comprend un générateur de pose léger avec confiance (PGC) pour estimer les poses et la confiance, ainsi qu'un transformateur d'alignement spatio-temporel conscient de la pose (PASTAT) pour l'alignement spatial. Un nouveau jeu de données de simulat…

The article presents a new framework for GNSS-free collaborative perception using LiDAR localization, addressing the challenges faced in GNSS-denied environments. Traditional localization methods often struggle in these settings, hindering effective collaboration among multi-agent systems. The proposed solution includes a lightweight Pose Generator with Confidence (PGC) for estimating poses and confidence, alongside the Pose-Aware Spatio-Temporal Alignment Transformer (PASTAT) for spatial alignment. A new simulation dataset, V2VLoc, is introduced, which supports LiDAR localization and collabor…

V2VLoc: Robust GNSS-Free Collaborative Perception via LiDAR Localization

arXiv:2511.14210v1 Announce Type: cross 
Abstract: We introduce Orion, a visual agent framework that can take in any modality and generate any modality. Using an agentic framework with multiple tool-calling capabilities, Orion is designed for visual AI tasks and achieves state-of-the-art results. Unlike traditional vision-language models that produce descriptive outputs, Orion orchestrates a suite of specialized computer vision tools, including object detection, keypoint localization, panoptic segmentation, Optical Character Recognition, and geometric analysis, to execute complex multi-step visual workflows. The system achieves competitive performance on MMMU, MMBench, DocVQA, and MMLongBench while extending monolithic vision-language models to production-grade visual intelligence. By combining neural perception with symbolic execution, Orion enables autonomous visual reasoning, marking a transition from passive visual understanding to active, tool-driven visual intelligence.

أورايون هو إطار جديد لوكيل بصري قادر على معالجة وتوليد أنماط متعددة. يستخدم إطارًا وكيلًا مع قدرات متعددة لاستدعاء الأدوات، محققًا نتائج رائدة في مهام الذكاء الاصطناعي البصري. على عكس نماذج الرؤية-اللغة التقليدية، يستخدم أورايون أدوات رؤية حاسوبية متخصصة لتنفيذ سير عمل بصري معقد، محققًا أداءً تنافسيًا في معايير مثل MMMU وMMBench وDocVQA وMMLongBench. يمثل هذا النظام تحولًا نحو الاستدلال البصري المستقل، مما يعزز الذكاء البصري.

Orion es un nuevo marco de agente visual capaz de procesar y generar diversas modalidades. Utiliza un marco agentivo con múltiples capacidades de llamada a herramientas, logrando resultados de vanguardia en tareas de IA visual. A diferencia de los modelos tradicionales de visión-lenguaje, Orion emplea herramientas especializadas de visión por computadora para flujos de trabajo visuales complejos, alcanzando un rendimiento competitivo en benchmarks como MMMU, MMBench, DocVQA y MMLongBench. Este sistema marca una transición hacia el razonamiento visual autónomo, mejorando la inteligencia visual.

Orion est un nouveau cadre d'agent visuel capable de traiter et de générer diverses modalités. Il utilise un cadre agentique avec plusieurs capacités d'appel d'outils, atteignant des résultats de pointe dans les tâches d'IA visuelle. Contrairement aux modèles traditionnels de vision-langage, Orion utilise des outils de vision par ordinateur spécialisés pour des flux de travail visuels complexes, obtenant des performances compétitives sur des benchmarks tels que MMMU, MMBench, DocVQA et MMLongBench. Ce système marque un tournant vers le raisonnement visuel autonome, améliorant l'intelligence vi…

Orion is a newly introduced visual agent framework capable of processing and generating various modalities. It employs an agentic framework with multiple tool-calling capabilities, achieving state-of-the-art results in visual AI tasks. Unlike traditional vision-language models, Orion utilizes specialized computer vision tools for complex visual workflows, achieving competitive performance on benchmarks like MMMU, MMBench, DocVQA, and MMLongBench. This system marks a shift towards autonomous visual reasoning, enhancing visual intelligence.

Binary Verification for Zero-Shot Vision

Was this article worth reading? Share it