arXiv:2511.08917v1 Announce Type: cross 
Abstract: Vision-Language Models (VLMs) are increasingly used by blind and low-vision (BLV) people to identify and understand products in their everyday lives, such as food, personal products, and household goods. Despite their prevalence, we lack an empirical understanding of how common image quality issues, like blur and misframing of items, affect the accuracy of VLM-generated captions and whether resulting captions meet BLV people's information needs. Grounded in a survey with 86 BLV people, we systematically evaluate how image quality issues affect captions generated by VLMs. We show that the best model recognizes products in images with no quality issues with 98% accuracy, but drops to 75% accuracy overall when quality issues are present, worsening considerably as issues compound. We discuss the need for model evaluations that center on disabled people's experiences throughout the process and offer concrete recommendations for HCI and ML researchers to make VLMs more reliable for BLV people.

أجريت دراسة لتقييم كيفية تأثير جودة الصورة على دقة التسميات التي تنتجها نماذج الرؤية-اللغة (VLM) للأشخاص المكفوفين وذوي الرؤية المنخفضة (BLV). أُجريت الدراسة مع 86 مشاركًا، ووجدت أن نماذج VLM تحقق دقة تصل إلى 98% مع الصور الواضحة، لكنها تنخفض إلى 75% عند وجود مشكلات في الجودة. تسلط النتائج الضوء على أهمية دمج تجارب المستخدمين ذوي الإعاقة في تقييمات النماذج لتحسين موثوقية نماذج VLM للأشخاص BLV.

Un estudio evaluó cómo la calidad de la imagen afecta la precisión de los subtítulos generados por los Modelos de Visión-Lenguaje (VLM) para personas ciegas y con baja visión (BLV). Realizado con 86 participantes, se encontró que los VLM lograban un 98% de precisión con imágenes claras, pero caía al 75% cuando había problemas de calidad. Los hallazgos destacan la importancia de incorporar las experiencias de los usuarios con discapacidad en las evaluaciones de modelos para mejorar la fiabilidad de los VLM para las personas BLV.

Une étude a évalué comment la qualité de l'image impacte l'exactitude des légendes générées par les modèles de vision-langage (VLM) pour les personnes aveugles et malvoyantes (BLV). Réalisée avec 86 participants, elle a révélé que les VLM atteignaient 98 % de précision avec des images claires, mais cela chutait à 75 % lorsque les images présentaient des problèmes de qualité. Les résultats soulignent l'importance d'incorporer les expériences des utilisateurs handicapés dans les évaluations des modèles pour améliorer la fiabilité des VLM pour les personnes BLV.

A study evaluated how image quality impacts the accuracy of captions generated by Vision-Language Models (VLMs) for blind and low-vision (BLV) individuals. Conducted with 86 participants, it found that VLMs achieved 98% accuracy with clear images, but this dropped to 75% when images had quality issues. The findings highlight the importance of incorporating the experiences of disabled users in model evaluations to enhance the reliability of VLMs for BLV people.

"It's trained by non-disabled people": Evaluating How Image Quality Affects Product Captioning with VLMs

<A HREF="https://fortune.com/2025/11/19/exclusive-doppel-raises-70-million-series-c-at-more-than-600-million-valuation-to-fight-ai-powered-social-engineering-attacks/"><IMG VSPACE="4" HSPACE="4" BORDER="0" ALIGN="RIGHT" SRC="http://www.techmeme.com/251119/i39.jpg"></A>
<A HREF="http://www.techmeme.com/251119/p39#a251119p39" TITLE="Techmeme permalink"><IMG WIDTH=11 HEIGHT=12 SRC="http://www.techmeme.com/img/pml.png" STYLE="border:none;padding:0;margin:0;"></A> Allie Garfinkle / <A HREF="http://www.fortune.com/">Fortune</A>: 
<A HREF="https://fortune.com/2025/11/19/exclusive-doppel-raises-70-million-series-c-at-more-than-600-million-valuation-to-fight-ai-powered-social-engineering-attacks/">Doppel, which makes an AI social engineering detection service, raised a $70M Series C led by Bessemer at a $600M+ valuation, up from $205M in May</A>&nbsp; &mdash;&nbsp; Senior Finance Reporter And Author Of Term Sheet&nbsp; &mdash;&nbsp; I tried to fool my brother, sort of.&nbsp; &mdash;&nbsp; Next to him and his Pekingese on the couch &hellip;

دوبل، وهي شركة متخصصة في خدمات الكشف عن الهندسة الاجتماعية باستخدام الذكاء الاصطناعي، نجحت في جمع 70 مليون دولار في جولة تمويل من السلسلة C بقيادة بيسمر. وقد زادت هذه الجولة من تقييم الشركة إلى أكثر من 600 مليون دولار، ارتفاعًا من 205 مليون دولار في مايو.

Doppel, una empresa que se especializa en servicios de detección de ingeniería social mediante IA, ha recaudado con éxito 70 millones de dólares en una ronda de financiación Serie C liderada por Bessemer. Esta ronda ha elevado la valoración de la empresa a más de 600 millones de dólares, un aumento significativo desde los 205 millones en mayo.

Doppel, une entreprise spécialisée dans les services de détection d'ingénierie sociale par IA, a levé avec succès 70 millions de dollars lors d'un tour de financement de série C dirigé par Bessemer. Ce tour de financement a porté la valorisation de l'entreprise à plus de 600 millions de dollars, contre 205 millions en mai.

Doppel, a company specializing in AI social engineering detection services, has successfully raised $70 million in a Series C funding round led by Bessemer. This funding round has increased the company's valuation to over $600 million, a significant rise from $205 million in May.

Doppel, which makes an AI social engineering detection service, raised a $70M Series C led by Bessemer at a $600M+ valuation, up from $205M in May (Allie Garfinkle/Fortune)

The service is customized for teachers' needs and includes added security and privacy, a collaborative workspace, and more.

OpenAI expands free educational offerings - here's what ChatGPT for Teachers can do

GPT-5.1-Codex-Max is ready to take on your next massive coding job. Here's what's new.

OpenAI's Codex Max solves one of my biggest AI coding annoyances - and adds dramatically faster performance

The agent offers one-click buying for all your holiday needs and will be free for all US-based users.

Perplexity's AI shopping tool is free for all now, just in time for Black Friday - how to use it

<a href="https://fetch.ai/">Fetch AI</a>, a startup founded and led by former DeepMind founding investor, Humayun Sheikh, <a href="https://www.businesswire.com/news/home/20251119088395/en/Fetch-Combines-Personalized-AI-with-Multi-Agent-Collaboration-to-Handle-Complex-Consumer-Tasks-Launches-Claim-Your-Agent-to-Fight-Brand-Knock-Offs">today announced the release</a> of three interconnected products designed to provide the trust, coordination, and interoperability needed for large-scale AI agent ecosystems. The launch includes <a href="https://asi1.ai/">ASI:One</a>, a personal-AI orchestration platform; <a href="https://business.fetch.ai/">Fetch Business</a>, a verification and discovery portal for brand agents; and <a href="https://agentverse.ai/?sort=relevancy&amp;page=1&amp;recommended=true">Agentverse</a>, an open directory hosting more than two million agents. Together, the system positions Fetch as an infrastructure provider for what it calls the “Agentic Web”—a layer where consumer AIs and brand AIs collaborate to complete tasks instead of merely suggesting them.The company says the tools address a central limitation in current consumer AI: models can provide recommendations but cannot reliably execute multi-step actions that require coordination across businesses. Fetch’s approach centers on enabling agents from different organizations to interoperate securely, using verified identities and shared context to complete end-to-end workflows.“We’re creating the same foundation for agents that Google created for websites,” said Humayun Sheikh, Founder and CEO of Fetch AI, and an early investor in DeepMind, in a press release provided to VentureBeat. “Instead of just finding information, your personal AI coordinates with verified brand agents to get things done.”<h2>Background: Fetch’s Founding and DeepMind Connection </h2>Fetch AI was founded in 2017 by Humayun Sheikh, an entrepreneur whose early investment in DeepMind helped support the company’s commercial development before its acquisition by Google. “I was one of the first five people at DeepMind and its first investor. My check was the first one in,” Sheikh said, reflecting on the period when advanced machine learning research was still largely inaccessible outside major technology companies.His early experience helped shape Fetch’s direction. “Even in 2013, it was clear to me that agentic systems were going to be the ones that worked. That’s where I focused—on the agentic web,” Sheikh noted. Fetch built on this thesis by developing infrastructure for autonomous software agents, focusing on verifiable identity, secure data exchange, and multi-agent coordination. Over the past several years, the company has expanded to a 70-person team across Cambridge and Menlo Park, raised approximately $60 million, and accumulated more than one million users interacting with its model—data that informed the design of the newly launched products.Sheikh added that his decision to bootstrap the company initially came directly from the proceeds of the DeepMind exit, noting in the interview that while the sale to Google was “a good exit,” he believed the team could have held out for a higher valuation. The early self-funding period allowed Fetch to begin work in 2015—well before transformer architectures went mainstream—on the hypothesis that agentic infrastructure would become foundational to applied AI.<h2>ASI:One — A Platform for Multi-Agent Orchestration</h2>At the core of the launch is ASI:One, a language model interface designed specifically for coordinating multiple agents rather than addressing isolated queries. Fetch describes it as an “intelligence layer” that handles context sharing, task routing, and preference modeling.The system stores user-level signals such as favored airlines, dietary constraints, budget ranges, loyalty program identifiers, and calendar availability. When a user requests a complex task—such as planning a trip with flights, hotels, and restaurant reservations—ASI:One retrieves those preferences and delegates work to the appropriate verified agents. The agents then return actionable outputs, including inventory and booking options, rather than generic recommendations.In practice, ASI:One functions as a workflow generator across organizational boundaries. By contrast with conventional LLM applications, which often rely on APIs or RAG techniques to surface information, ASI:One is built to coordinate autonomous agents that can complete transactions. Fetch notes that personalization improves over time as the model accumulates structured preference data.Sheikh emphasized the distinction between orchestrated execution and traditional AI output. “This isn’t searching for options separately and hoping they work together,” he said. “It’s orchestration.” He added that Fetch’s architecture is intentionally modular: “Our architecture is a mix of agentic and expert models. One large model isn’t enough—you need specialists. That’s why we built ASI1, tuned specifically for agentic systems.”The interview also revealed new details about ASI:One’s personalization systems: the platform uses multiple user-owned knowledge graphs to store preferences, travel history, social connections, and contextual constraints. These knowledge graphs are siloed per user and not co-mingled with any Fetch-operated data. Sheikh described this as a “deterministic backbone” that gives the personal AI a stable memory layer beyond the probabilistic output of a single large model.ASI:One launches in Beta today, with a broader release planned for early 2026. Fetch also offers ASI:One Mobile, released earlier this year, giving users access to the same agent-orchestration capabilities on iOS and Android. The mobile app connects directly to Agentverse and the user’s knowledge graphs, enabling on-the-go task execution and real-time interaction with registered agents.<h2>Fetch Business — Verified Identity and Brand Control</h2>To enable reliable coordination between consumers and companies, Fetch is introducing a verification and discovery portal called Fetch Business. The platform allows organizations to verify their identity and claim an official Brand Agent handle—for example, @Hilton or @Nike—regardless of which tools they use to build the underlying agent.Fetch positions the product as an analogue to ICANN domain registration and SSL certificate systems for websites. Verified status is intended to protect consumers from interacting with counterfeit or untrusted agents, a problem the company describes as a major barrier to widespread agent adoption.The system includes low-code tools for small businesses to create agents in a few steps and connect real-time APIs such as inventory, booking systems, or CRM platforms. “With Fetch, you can create an agent in one minute. It gets a handle, like a Twitter username, and you can personalize it completely—even give it your social media permissions to post on your behalf,” Sheikh said. Once a brand claims its namespace, its agent becomes discoverable to consumer AIs and other agents inside Agentverse.The company has pre-reserved thousands of brand namespaces in anticipation of demand. Verification status persists across any platform that integrates with Agentverse, creating a portable identity layer for business agents.The interview highlighted that Fetch Business inherits web-trust primitives directly: domain owners verify their identity by inserting a short code snippet into their existing website backend, allowing the system to pass a cryptographic challenge and grant the agent an authenticity badge similar to a “blue check” for agent identities. Sheikh framed this as “reusing the trust layer the web already spent decades building.”Companies can begin claiming agents now at <a href="https://business.fetch.ai/">business.fetch.ai</a>.<h2>Agentverse — An Open Directory of More Than Two Million Agents</h2>The final component of the release is <a href="https://agentverse.ai/">Agentverse</a>, an open directory and cloud platform that hosts agents and enables cross-ecosystem discoverability. Fetch states that millions of agents have already registered, spanning travel, retail, entertainment, food service, and enterprise categories.Agentverse provides metadata, capability descriptions, and routing logic that ASI:One uses to identify appropriate agents for specific tasks. It also supports secure communication and data exchange between agents. The company notes that the directory is platform-agnostic: agents built with any framework can join and interoperate.According to Sheikh, the lack of a discovery layer is one reason most AI agents see little or no usage. “Ninety percent of AI agents never get used because there’s no discovery layer,” he said. He framed the role of Agentverse in more technical terms: “Right now, if you build an agent, there’s no universal way for others to discover it. That’s what AgentVerse solves—it’s like DNS for agents.” He also described the system as an essential component of the emerging agent economy: “Fetch is building the Google of agents. Just like websites needed search, agents need discovery, trust, and interaction—Fetch provides all of that.”The interview further underscored that Agentverse is cloud-agnostic by design. Sheikh contrasted this with competing agent ecosystems tied to specific cloud providers, arguing that a universal registry is only viable if independent of proprietary cloud environments. He said the open architecture enables an LLM to query any agent “within one minute of deployment,” turning agent publication into a near-instantaneous process similar to registering a domain.Agentverse also integrates payment pathways, enabling agents to execute purchases using partners such as Visa, Skyfire, and supported stablecoins. Consumers can configure spending limits or require explicit approval for transactions.<h2>Industry Context and Implications</h2>Fetch’s launch comes at a time when consumer AI platforms are exploring the shift from static chat interfaces toward autonomous agents capable of completing actions. However, most agent systems remain limited by siloed architectures, limited interoperability, and weak verification standards.Fetch positions its infrastructure as a response to these limitations by providing a cross-platform coordination layer, identity system, and directory service. The company argues that an agent ecosystem requires consistent verification mechanisms to ensure that consumers interact with authentic brand representatives rather than imitations. By establishing namespace control and portable trust indicators, Fetch Business aims to fill a gap similar to early web domain verification.At the same time, ASI:One attempts to centralize user preference data in a way that enables more efficient personalization and multi-agent coordination. This approach differs from generalist LLM applications, which often lack persistent preference architectures or direct access to brand-controlled agents.The interview also made clear that micropayments and digital transaction infrastructure are central to Fetch’s long-term vision. Sheikh referenced integrations with protocols such as Coinbase’s 402 and AP2, positioning these capabilities as essential for autonomous agents to complete end-to-end tasks that include financial execution.<h2>Takeaway</h2>Fetch’s combined release of ASI:One, Fetch Business, and Agentverse introduces an interconnected stack designed to support large-scale deployment and usage of AI agents. The company frames the system as foundational infrastructure for an agentic ecosystem, where consumer AIs can coordinate with verified brand agents to complete tasks reliably and securely. The additions to its identity, discovery, and orchestration layers reflect Fetch’s long-standing thesis—rooted partly in lessons from DeepMind’s early development—that intelligence becomes meaningful only when paired with the capacity to act.

أطلقت شركة Fetch AI، وهي شركة ناشئة يقودها هميون شيخ، ثلاثة منتجات مترابطة تهدف إلى تعزيز نظام وكلاء الذكاء الاصطناعي. تشمل العروض الجديدة ASI:One، وهي منصة تنسيق ذكاء اصطناعي شخصي، وFetch Business، وهو بوابة للتحقق والاكتشاف لوكلاء العلامات التجارية، وAgentverse، وهو دليل مفتوح يضم أكثر من مليوني وكيل. تسعى هذه المبادرة إلى إنشاء بنية تحتية قوية لما تصفه Fetch بـ 'الويب الوكالي.'

Fetch AI, una startup dirigida por Humayun Sheikh, ha lanzado tres productos interconectados destinados a mejorar el ecosistema de agentes de IA. Las nuevas ofertas incluyen ASI:One, una plataforma de orquestación de IA personal, Fetch Business, un portal de verificación y descubrimiento para agentes de marca, y Agentverse, un directorio abierto con más de dos millones de agentes. Esta iniciativa busca establecer una infraestructura sólida para lo que Fetch describe como la 'Web Agente.'

Fetch AI, une startup dirigée par Humayun Sheikh, a lancé trois produits interconnectés visant à améliorer l'écosystème des agents IA. Les nouvelles offres comprennent ASI:One, une plateforme d'orchestration d'IA personnelle, Fetch Business, un portail de vérification et de découverte pour les agents de marque, et Agentverse, un annuaire ouvert avec plus de deux millions d'agents. Cette initiative vise à établir une infrastructure robuste pour ce que Fetch décrit comme le 'Web Agentique.'

Fetch AI, a startup led by Humayun Sheikh, has launched three interconnected products aimed at enhancing the ecosystem of AI agents. The new offerings include ASI:One, a personal-AI orchestration platform, Fetch Business, a verification and discovery portal for brand agents, and Agentverse, an open directory with over two million agents. This initiative seeks to establish a robust infrastructure for what Fetch describes as the 'Agentic Web.'

The Google Search of AI agents? Fetch launches ASI:One and Business tier for new era of non-human web

If aesthetics and efficiency top your list of needs, there are several Linux distributions that are right up your alley. Both Ubuntu Budgie and Pop!_OS should top that list.

Ubuntu Budgie vs. Pop!_OS: I've used both Linux distros - here's how to choose

arXiv:2511.14109v1 Announce Type: new 
Abstract: Visual Place Recognition (VPR) aims to match query images against a database using visual cues. State-of-the-art methods aggregate features from deep backbones to form global descriptors. Optimal transport-based aggregation methods reformulate feature-to-cluster assignment as a transport problem, but the standard Sinkhorn algorithm symmetrically treats source and target marginals, limiting effectiveness when image features and cluster centers exhibit substantially different distributions. We propose an asymmetric aggregation VPR method with geometric constraints for locally aggregated descriptors, called $A^2$GC-VPR. Our method employs row-column normalization averaging with separate marginal calibration, enabling asymmetric matching that adapts to distributional discrepancies in visual place recognition. Geometric constraints are incorporated through learnable coordinate embeddings, computing compatibility scores fused with feature similarities, thereby promoting spatially proximal features to the same cluster and enhancing spatial awareness. Experimental results on MSLS, NordLand, and Pittsburgh datasets demonstrate superior performance, validating the effectiveness of our approach in improving matching accuracy and robustness.

$A^2$GC-VPR هو أسلوب جديد للتعرف على الأماكن البصرية (VPR) يتناول قيود أساليب التجميع التقليدية في مطابقة صور الاستعلام مع قاعدة بيانات. من خلال اعتماد نهج تجميع غير متماثل مع قيود هندسية، يعزز هذا الأسلوب فعالية مطابقة الميزات، خاصة عند التعامل مع توزيعات متباينة لميزات الصورة ومراكز التجمع. تستخدم التقنية متوسطات تطبيع الصفوف والأعمدة مع تضمينات إحداثيات قابلة للتعلم لتحسين درجات التوافق لوصفيات التجميع المحلي.

$A^2$GC-VPR es un nuevo método para el Reconocimiento Visual de Lugares (VPR) que aborda las limitaciones de los métodos de agregación tradicionales al emparejar imágenes de consulta con una base de datos. Al emplear un enfoque de agregación asimétrica con restricciones geométricas, este método mejora la efectividad del emparejamiento de características, especialmente cuando se enfrentan a distribuciones variables de características de imagen y centros de clúster. La técnica utiliza promedios de normalización fila-columna y embeddings de coordenadas aprendibles para mejorar las puntuaciones de…

$A^2$GC-VPR est une nouvelle méthode pour la reconnaissance de lieux visuels (VPR) qui s'attaque aux limites des méthodes d'agrégation traditionnelles dans l'appariement d'images de requête à une base de données. En adoptant une approche d'agrégation asymétrique avec des contraintes géométriques, cette méthode améliore l'efficacité de l'appariement des caractéristiques, en particulier lorsqu'il s'agit de distributions variées des caractéristiques d'image et des centres de clusters. La technique utilise une moyenne de normalisation ligne-colonne et des embeddings de coordonnées apprenables pour…

$A^2$GC-VPR is a new method for Visual Place Recognition (VPR) that addresses the limitations of traditional aggregation methods in matching query images to a database. By employing an asymmetric aggregation approach with geometric constraints, this method enhances the effectiveness of feature matching, particularly when dealing with varying distributions of image features and cluster centers. The technique utilizes row-column normalization averaging and learnable coordinate embeddings to improve compatibility scores for locally aggregated descriptors.

$A^2$GC: $A$symmetric $A$ggregation with Geometric Constraints for Locally Aggregated Descriptors

arXiv:2511.14247v1 Announce Type: new 
Abstract: Multi-agents rely on accurate poses to share and align observations, enabling a collaborative perception of the environment. However, traditional GNSS-based localization often fails in GNSS-denied environments, making consistent feature alignment difficult in collaboration. To tackle this challenge, we propose a robust GNSS-free collaborative perception framework based on LiDAR localization. Specifically, we propose a lightweight Pose Generator with Confidence (PGC) to estimate compact pose and confidence representations. To alleviate the effects of localization errors, we further develop the Pose-Aware Spatio-Temporal Alignment Transformer (PASTAT), which performs confidence-aware spatial alignment while capturing essential temporal context. Additionally, we present a new simulation dataset, V2VLoc, which can be adapted for both LiDAR localization and collaborative detection tasks. V2VLoc comprises three subsets: Town1Loc, Town4Loc, and V2VDet. Town1Loc and Town4Loc offer multi-traversal sequences for training in localization tasks, whereas V2VDet is specifically intended for the collaborative detection task. Extensive experiments conducted on the V2VLoc dataset demonstrate that our approach achieves state-of-the-art performance under GNSS-denied conditions. We further conduct extended experiments on the real-world V2V4Real dataset to validate the effectiveness and generalizability of PASTAT.

يقدم المقال إطارًا جديدًا للإدراك التعاوني بدون GNSS باستخدام تحديد المواقع بواسطة LiDAR، حيث يتناول التحديات التي تواجهها البيئات التي تفتقر إلى GNSS. غالبًا ما تواجه طرق تحديد المواقع التقليدية صعوبات في هذه البيئات، مما يعيق التعاون الفعال بين أنظمة الوكلاء المتعددة. تتضمن الحلول المقترحة مولد وضع خفيف الوزن مع ثقة (PGC) لتقدير الأوضاع وتمثيلات الثقة، بالإضافة إلى محول التوافق الزماني المكاني الواعي بالوضع (PASTAT) الذي يقوم بأداء التوافق المكاني مع مراعاة الثقة. كما تم تقديم مجموعة بيانات محاكاة جديدة، V2VLoc، التي يمكن تكييفها لمهام تحديد المواقع بواسطة LiDAR والاكتشاف التعاوني.

El artículo presenta un nuevo marco para la percepción colaborativa sin GNSS utilizando la localización por LiDAR, abordando los desafíos que se enfrentan en entornos sin GNSS. Los métodos de localización tradicionales a menudo tienen dificultades en estos entornos, lo que dificulta la colaboración efectiva entre sistemas multiagente. La solución propuesta incluye un Generador de Pose con Confianza (PGC) para estimar poses y confianza, junto con el Transformador de Alineación Espacio-Temporal Consciente de la Pose (PASTAT) para el alineamiento espacial. Se introduce un nuevo conjunto de datos …

L'article présente un nouveau cadre pour la perception collaborative sans GNSS utilisant la localisation par LiDAR, abordant les défis rencontrés dans les environnements privés de GNSS. Les méthodes de localisation traditionnelles peinent souvent dans ces contextes, entravant la collaboration efficace entre systèmes multi-agents. La solution proposée comprend un générateur de pose léger avec confiance (PGC) pour estimer les poses et la confiance, ainsi qu'un transformateur d'alignement spatio-temporel conscient de la pose (PASTAT) pour l'alignement spatial. Un nouveau jeu de données de simulat…

The article presents a new framework for GNSS-free collaborative perception using LiDAR localization, addressing the challenges faced in GNSS-denied environments. Traditional localization methods often struggle in these settings, hindering effective collaboration among multi-agent systems. The proposed solution includes a lightweight Pose Generator with Confidence (PGC) for estimating poses and confidence, alongside the Pose-Aware Spatio-Temporal Alignment Transformer (PASTAT) for spatial alignment. A new simulation dataset, V2VLoc, is introduced, which supports LiDAR localization and collabor…

V2VLoc: Robust GNSS-Free Collaborative Perception via LiDAR Localization

arXiv:2511.14210v1 Announce Type: cross 
Abstract: We introduce Orion, a visual agent framework that can take in any modality and generate any modality. Using an agentic framework with multiple tool-calling capabilities, Orion is designed for visual AI tasks and achieves state-of-the-art results. Unlike traditional vision-language models that produce descriptive outputs, Orion orchestrates a suite of specialized computer vision tools, including object detection, keypoint localization, panoptic segmentation, Optical Character Recognition, and geometric analysis, to execute complex multi-step visual workflows. The system achieves competitive performance on MMMU, MMBench, DocVQA, and MMLongBench while extending monolithic vision-language models to production-grade visual intelligence. By combining neural perception with symbolic execution, Orion enables autonomous visual reasoning, marking a transition from passive visual understanding to active, tool-driven visual intelligence.

أورايون هو إطار جديد لوكيل بصري قادر على معالجة وتوليد أنماط متعددة. يستخدم إطارًا وكيلًا مع قدرات متعددة لاستدعاء الأدوات، محققًا نتائج رائدة في مهام الذكاء الاصطناعي البصري. على عكس نماذج الرؤية-اللغة التقليدية، يستخدم أورايون أدوات رؤية حاسوبية متخصصة لتنفيذ سير عمل بصري معقد، محققًا أداءً تنافسيًا في معايير مثل MMMU وMMBench وDocVQA وMMLongBench. يمثل هذا النظام تحولًا نحو الاستدلال البصري المستقل، مما يعزز الذكاء البصري.

Orion es un nuevo marco de agente visual capaz de procesar y generar diversas modalidades. Utiliza un marco agentivo con múltiples capacidades de llamada a herramientas, logrando resultados de vanguardia en tareas de IA visual. A diferencia de los modelos tradicionales de visión-lenguaje, Orion emplea herramientas especializadas de visión por computadora para flujos de trabajo visuales complejos, alcanzando un rendimiento competitivo en benchmarks como MMMU, MMBench, DocVQA y MMLongBench. Este sistema marca una transición hacia el razonamiento visual autónomo, mejorando la inteligencia visual.

Orion est un nouveau cadre d'agent visuel capable de traiter et de générer diverses modalités. Il utilise un cadre agentique avec plusieurs capacités d'appel d'outils, atteignant des résultats de pointe dans les tâches d'IA visuelle. Contrairement aux modèles traditionnels de vision-langage, Orion utilise des outils de vision par ordinateur spécialisés pour des flux de travail visuels complexes, obtenant des performances compétitives sur des benchmarks tels que MMMU, MMBench, DocVQA et MMLongBench. Ce système marque un tournant vers le raisonnement visuel autonome, améliorant l'intelligence vi…

Orion is a newly introduced visual agent framework capable of processing and generating various modalities. It employs an agentic framework with multiple tool-calling capabilities, achieving state-of-the-art results in visual AI tasks. Unlike traditional vision-language models, Orion utilizes specialized computer vision tools for complex visual workflows, achieving competitive performance on benchmarks like MMMU, MMBench, DocVQA, and MMLongBench. This system marks a shift towards autonomous visual reasoning, enhancing visual intelligence.

"It's trained by non-disabled people": Evaluating How Image Quality Affects Product Captioning with VLMs

Was this article worth reading? Share it