arXiv:2511.11407v1 Announce Type: new 
Abstract: Multimodal Large Language Models are increasingly applied to biomedical imaging, yet scientific reasoning for microscopy remains limited by the scarcity of large-scale, high-quality training data. We introduce MicroVQA++, a three-stage, large-scale and high-quality microscopy VQA corpus derived from the BIOMEDICA archive. Stage one bootstraps supervision from expert-validated figure-caption pairs sourced from peer-reviewed articles. Stage two applies HiCQA-Graph, a novel heterogeneous graph over images, captions, and QAs that fuses NLI-based textual entailment, CLIP-based vision-language alignment, and agent signals to identify and filter inconsistent samples. Stage three uses a MultiModal Large Language Model (MLLM) agent to generate multiple-choice questions (MCQ) followed by human screening. The resulting release comprises a large training split and a human-checked test split whose Bloom's level hard-sample distribution exceeds the MicroVQA benchmark. Our work delivers (i) a quality-controlled dataset that couples expert literature with graph-based filtering and human refinement; (ii) HiCQA-Graph, the first graph that jointly models (image, caption, QA) for cross-modal consistency filtering; (iii) evidence that careful data construction enables 4B-scale MLLMs to reach competitive microscopy reasoning performance (e.g., GPT-5) and achieve state-of-the-art performance among open-source MLLMs. Code and dataset will be released after the review process concludes.

MicroVQA++ هو مجموعة بيانات جديدة عالية الجودة للسببية في علم المجهر مصممة لنماذج اللغة متعددة الوسائط (MLLM). تم اشتقاقها من أرشيف BIOMEDICA وتتكون من عملية من ثلاث مراحل تشمل أزواج الشكل-التعليق المعتمدة من قبل الخبراء، ورسم بياني غير متجانس مبتكر لتصفية العينات غير المتسقة، وأسئلة متعددة الخيارات تم التحقق منها بواسطة البشر. تهدف هذه المجموعة إلى تحسين التفكير العلمي في التصوير الطبي الحيوي، مع معالجة القيود الحالية بسبب نقص بيانات التدريب على نطاق واسع.

MicroVQA++ es un nuevo conjunto de datos de razonamiento en microscopía de alta calidad diseñado para modelos de lenguaje multimodales (MLLM). Se deriva del archivo BIOMEDICA y consta de un proceso en tres etapas que incluye pares figura-leyenda validados por expertos, un novedoso grafo heterogéneo para filtrar muestras inconsistentes y preguntas de opción múltiple revisadas por humanos. Este conjunto de datos busca mejorar el razonamiento científico en imágenes biomédicas, abordando las limitaciones actuales debido a la escasez de datos de entrenamiento a gran escala.

MicroVQA++ est un nouvel ensemble de données de raisonnement en microscopie de haute qualité conçu pour les modèles de langage multimodaux (MLLM). Il est dérivé de l'archive BIOMEDICA et comprend un processus en trois étapes qui inclut des paires figure-légende validées par des experts, un graphe hétérogène novateur pour filtrer les échantillons incohérents et des questions à choix multiples vérifiées par des humains. Cet ensemble de données vise à améliorer le raisonnement scientifique en imagerie biomédicale, répondant aux limitations actuelles dues au manque de données d'entraînement à gran…

MicroVQA++ is a newly introduced high-quality microscopy reasoning dataset designed for multimodal large language models (MLLMs). It is derived from the BIOMEDICA archive and consists of a three-stage process that includes expert-validated figure-caption pairs, a novel heterogeneous graph for filtering inconsistent samples, and human-checked multiple-choice questions. This dataset aims to enhance scientific reasoning in biomedical imaging, addressing the current limitations due to the lack of large-scale training data.

MicroVQA++: High-Quality Microscopy Reasoning Dataset with Weakly Supervised Graphs for Multimodal Large Language Model

Carry your cash and cards along with your phone thanks to MagSafe wallets from brands like Moft, ESR, and Nomad.

Best MagSafe wallets 2025: I tested the best options to store your cash and cards

With the latest Grit X2, Polar continues to optimize experiences for training, coaching, and performance improvement.

Not enough people are talking about this Garmin competitor that wins in unique ways

What do you get for the person who has everything? There's a good chance that something on this list will appeal to them.

The 12 best gizmos to gift the person who has everything, according to a gadget expert

Great deals are popping up everywhere this holiday shopping season, and kids' tech is no exception.

I'm always looking for deals on kids' tech - here's what I'm buying this holiday season

We tested and reviewed the best 40-inch TVs you can buy from brands like Samsung, Sony, and TCL to help you find the right fit for your budget and home theater.

The best 40-inch TVs of 2025: Expert tested for smaller spaces

We review dozens of laptops each year at ZDNET. These are the most popular ones among our readers for 2025.

The top 10 laptops our readers bought this year (no. 1 surprised us)

arXiv:2511.08195v2 Announce Type: replace 
Abstract: User interface (UI) programming is a core yet highly complex part of modern software development. Recent advances in visual language models (VLMs) highlight the potential of automatic UI coding, but current approaches face two key limitations: multimodal coding capabilities remain underdeveloped, and single-turn paradigms make little use of iterative visual feedback. We address these challenges with an interactive UI-to-code paradigm that better reflects real-world workflows and raises the upper bound of achievable performance. Under this paradigm, we present UI2Code$^\text{N}$, a visual language model trained through staged pretraining, fine-tuning, and reinforcement learning to achieve foundational improvements in multimodal coding. The model unifies three key capabilities: UI-to-code generation, UI editing, and UI polishing. We further explore test-time scaling for interactive generation, enabling systematic use of multi-turn feedback. Experiments on UI-to-code and UI polishing benchmarks show that UI2Code$^\text{N}$ establishes a new state of the art among open-source models and achieves performance comparable to leading closed-source models such as Claude-4-Sonnet and GPT-5. Our code and models are available at https://github.com/zai-org/UI2Code_N.

برمجة واجهة المستخدم (UI) هي جزء معقد من تطوير البرمجيات الحديثة. تظهر التطورات الأخيرة في نماذج اللغة المرئية (VLMs) إمكانية الترميز التلقائي لواجهة المستخدم، ولكن الأساليب الحالية تواجه قيودًا رئيسية: قدرات الترميز متعددة الوسائط لا تزال غير متطورة، ونماذج التفاعل الفردية لا تستفيد كثيرًا من التغذية الراجعة البصرية التكرارية. نحن نتناول هذه التحديات من خلال نموذج UI2Code^N، وهو نموذج لغة مرئية تم تدريبه من خلال مراحل ما قبل التدريب، والتعديل الدقيق، والتعلم المعزز لتحقيق تحسينات أساسية في الترميز متعدد الوسائط.

La programación de interfaces de usuario (UI) es un aspecto complejo del desarrollo de software. Los avances recientes en modelos de lenguaje visual (VLM) muestran un potencial para la codificación automática de UI, pero los métodos actuales enfrentan limitaciones en capacidades multimodales y retroalimentación iterativa. El modelo UI2Code^N aborda estos problemas mediante un enfoque interactivo de UI a código, mejorando el rendimiento al integrar la generación, edición y pulido de UI. Este modelo se entrena a través de preentrenamiento por etapas, ajuste fino y aprendizaje por refuerzo, con e…

La programmation d'interface utilisateur (UI) est un aspect complexe du développement logiciel. Les avancées récentes dans les modèles de langage visuel (VLM) montrent un potentiel pour le codage automatique d'UI, mais les méthodes existantes rencontrent des limitations en matière de capacités multimodales et de rétroaction itérative. Le modèle UI2Code^N aborde ces problèmes grâce à une approche interactive de l'UI vers le code, améliorant les performances en intégrant la génération, l'édition et le polissage de l'UI. Ce modèle est formé par préentraînement par étapes, ajustement fin et appren…

UI programming is a complex aspect of software development. Recent advancements in visual language models (VLMs) show promise for automatic UI coding, yet existing methods face limitations in multimodal capabilities and iterative feedback. The UI2Code^N model addresses these issues through an interactive UI-to-code approach, enhancing performance by integrating UI generation, editing, and polishing. This model is trained using staged pretraining, fine-tuning, and reinforcement learning, aiming to improve multimodal coding significantly.

UI2Code^N: A Visual Language Model for Test-Time Scalable Interactive UI-to-Code Generation

arXiv:2511.11470v1 Announce Type: new 
Abstract: Recent advances in generative modeling have substantially enhanced 3D urban generation, enabling applications in digital twins, virtual cities, and large-scale simulations. However, existing methods face two key challenges: (1) the need for large-scale 3D city assets for supervised training, which are difficult and costly to obtain, and (2) reliance on semantic or height maps, which are used exclusively for generating buildings in virtual worlds and lack connection to real-world appearance, limiting the realism and generalizability of generated cities. To address these limitations, we propose Sat2RealCity, a geometry-aware and appearance-controllable framework for 3D urban generation from real-world satellite imagery. Unlike previous city-level generation methods, Sat2RealCity builds generation upon individual building entities, enabling the use of rich priors and pretrained knowledge from 3D object generation while substantially reducing dependence on large-scale 3D city assets. Specifically, (1) we introduce the OSM-based spatial priors strategy to achieve interpretable geometric generation from spatial topology to building instances; (2) we design an appearance-guided controllable modeling mechanism for fine-grained appearance realism and style control; and (3) we construct an MLLM-powered semantic-guided generation pipeline, bridging semantic interpretation and geometric reconstruction. Extensive quantitative and qualitative experiments demonstrate that Sat2RealCity significantly surpasses existing baselines in structural consistency and appearance realism, establishing a strong foundation for real-world aligned 3D urban content creation. The code will be released soon.

تقدم الورقة المعنونة 'Sat2RealCity: Geometry-Aware and Appearance-Controllable 3D Urban Generation from Satellite Imagery' إطارًا مبتكرًا لتوليد البيئات الحضرية ثلاثية الأبعاد باستخدام صور الأقمار الصناعية من العالم الحقيقي. تتناول هذه الطريقة تحديات كبيرة في الأساليب الحالية، مثل الحاجة إلى أصول مدنية ثلاثية الأبعاد واسعة النطاق والاعتماد على الخرائط الدلالية أو خرائط الارتفاع. من خلال التركيز على كيانات المباني الفردية، يعزز Sat2RealCity الواقعية وقابلية التعميم في نمذجة المدن.

El artículo titulado 'Sat2RealCity: Geometry-Aware and Appearance-Controllable 3D Urban Generation from Satellite Imagery' presenta un marco novedoso para la generación de entornos urbanos en 3D utilizando imágenes satelitales del mundo real. Este enfoque aborda desafíos significativos en los métodos existentes, como la necesidad de activos urbanos 3D extensos y las limitaciones de los mapas semánticos o de altura. Al centrarse en entidades individuales de edificios, Sat2RealCity mejora el realismo y la generalizabilidad en la modelización urbana.

L'article intitulé 'Sat2RealCity: Geometry-Aware and Appearance-Controllable 3D Urban Generation from Satellite Imagery' présente un cadre novateur pour la génération d'environnements urbains 3D à partir d'images satellites du monde réel. Cette approche répond à des défis importants dans les méthodes existantes, tels que le besoin d'actifs urbains 3D étendus et les limitations des cartes sémantiques ou de hauteur. En se concentrant sur des entités de bâtiments individuelles, Sat2RealCity améliore le réalisme et la généralisabilité de la modélisation urbaine.

The paper titled 'Sat2RealCity: Geometry-Aware and Appearance-Controllable 3D Urban Generation from Satellite Imagery' presents a novel framework for generating 3D urban environments using real-world satellite images. This approach addresses significant challenges in existing methods, such as the need for extensive 3D city assets and the limitations of semantic or height maps. By focusing on individual building entities, Sat2RealCity enhances realism and generalizability in urban modeling.

Sat2RealCity: Geometry-Aware and Appearance-Controllable 3D Urban Generation from Satellite Imagery

arXiv:2511.14109v1 Announce Type: new 
Abstract: Visual Place Recognition (VPR) aims to match query images against a database using visual cues. State-of-the-art methods aggregate features from deep backbones to form global descriptors. Optimal transport-based aggregation methods reformulate feature-to-cluster assignment as a transport problem, but the standard Sinkhorn algorithm symmetrically treats source and target marginals, limiting effectiveness when image features and cluster centers exhibit substantially different distributions. We propose an asymmetric aggregation VPR method with geometric constraints for locally aggregated descriptors, called $A^2$GC-VPR. Our method employs row-column normalization averaging with separate marginal calibration, enabling asymmetric matching that adapts to distributional discrepancies in visual place recognition. Geometric constraints are incorporated through learnable coordinate embeddings, computing compatibility scores fused with feature similarities, thereby promoting spatially proximal features to the same cluster and enhancing spatial awareness. Experimental results on MSLS, NordLand, and Pittsburgh datasets demonstrate superior performance, validating the effectiveness of our approach in improving matching accuracy and robustness.

$A^2$GC-VPR هو أسلوب جديد للتعرف على الأماكن البصرية (VPR) يتناول قيود أساليب التجميع التقليدية في مطابقة صور الاستعلام مع قاعدة بيانات. من خلال اعتماد نهج تجميع غير متماثل مع قيود هندسية، يعزز هذا الأسلوب فعالية مطابقة الميزات، خاصة عند التعامل مع توزيعات متباينة لميزات الصورة ومراكز التجمع. تستخدم التقنية متوسطات تطبيع الصفوف والأعمدة مع تضمينات إحداثيات قابلة للتعلم لتحسين درجات التوافق لوصفيات التجميع المحلي.

$A^2$GC-VPR es un nuevo método para el Reconocimiento Visual de Lugares (VPR) que aborda las limitaciones de los métodos de agregación tradicionales al emparejar imágenes de consulta con una base de datos. Al emplear un enfoque de agregación asimétrica con restricciones geométricas, este método mejora la efectividad del emparejamiento de características, especialmente cuando se enfrentan a distribuciones variables de características de imagen y centros de clúster. La técnica utiliza promedios de normalización fila-columna y embeddings de coordenadas aprendibles para mejorar las puntuaciones de…

$A^2$GC-VPR est une nouvelle méthode pour la reconnaissance de lieux visuels (VPR) qui s'attaque aux limites des méthodes d'agrégation traditionnelles dans l'appariement d'images de requête à une base de données. En adoptant une approche d'agrégation asymétrique avec des contraintes géométriques, cette méthode améliore l'efficacité de l'appariement des caractéristiques, en particulier lorsqu'il s'agit de distributions variées des caractéristiques d'image et des centres de clusters. La technique utilise une moyenne de normalisation ligne-colonne et des embeddings de coordonnées apprenables pour…

$A^2$GC-VPR is a new method for Visual Place Recognition (VPR) that addresses the limitations of traditional aggregation methods in matching query images to a database. By employing an asymmetric aggregation approach with geometric constraints, this method enhances the effectiveness of feature matching, particularly when dealing with varying distributions of image features and cluster centers. The technique utilizes row-column normalization averaging and learnable coordinate embeddings to improve compatibility scores for locally aggregated descriptors.

$A^2$GC: $A$symmetric $A$ggregation with Geometric Constraints for Locally Aggregated Descriptors

arXiv:2511.14247v1 Announce Type: new 
Abstract: Multi-agents rely on accurate poses to share and align observations, enabling a collaborative perception of the environment. However, traditional GNSS-based localization often fails in GNSS-denied environments, making consistent feature alignment difficult in collaboration. To tackle this challenge, we propose a robust GNSS-free collaborative perception framework based on LiDAR localization. Specifically, we propose a lightweight Pose Generator with Confidence (PGC) to estimate compact pose and confidence representations. To alleviate the effects of localization errors, we further develop the Pose-Aware Spatio-Temporal Alignment Transformer (PASTAT), which performs confidence-aware spatial alignment while capturing essential temporal context. Additionally, we present a new simulation dataset, V2VLoc, which can be adapted for both LiDAR localization and collaborative detection tasks. V2VLoc comprises three subsets: Town1Loc, Town4Loc, and V2VDet. Town1Loc and Town4Loc offer multi-traversal sequences for training in localization tasks, whereas V2VDet is specifically intended for the collaborative detection task. Extensive experiments conducted on the V2VLoc dataset demonstrate that our approach achieves state-of-the-art performance under GNSS-denied conditions. We further conduct extended experiments on the real-world V2V4Real dataset to validate the effectiveness and generalizability of PASTAT.

يقدم المقال إطارًا جديدًا للإدراك التعاوني بدون GNSS باستخدام تحديد المواقع بواسطة LiDAR، حيث يتناول التحديات التي تواجهها البيئات التي تفتقر إلى GNSS. غالبًا ما تواجه طرق تحديد المواقع التقليدية صعوبات في هذه البيئات، مما يعيق التعاون الفعال بين أنظمة الوكلاء المتعددة. تتضمن الحلول المقترحة مولد وضع خفيف الوزن مع ثقة (PGC) لتقدير الأوضاع وتمثيلات الثقة، بالإضافة إلى محول التوافق الزماني المكاني الواعي بالوضع (PASTAT) الذي يقوم بأداء التوافق المكاني مع مراعاة الثقة. كما تم تقديم مجموعة بيانات محاكاة جديدة، V2VLoc، التي يمكن تكييفها لمهام تحديد المواقع بواسطة LiDAR والاكتشاف التعاوني.

El artículo presenta un nuevo marco para la percepción colaborativa sin GNSS utilizando la localización por LiDAR, abordando los desafíos que se enfrentan en entornos sin GNSS. Los métodos de localización tradicionales a menudo tienen dificultades en estos entornos, lo que dificulta la colaboración efectiva entre sistemas multiagente. La solución propuesta incluye un Generador de Pose con Confianza (PGC) para estimar poses y confianza, junto con el Transformador de Alineación Espacio-Temporal Consciente de la Pose (PASTAT) para el alineamiento espacial. Se introduce un nuevo conjunto de datos …

L'article présente un nouveau cadre pour la perception collaborative sans GNSS utilisant la localisation par LiDAR, abordant les défis rencontrés dans les environnements privés de GNSS. Les méthodes de localisation traditionnelles peinent souvent dans ces contextes, entravant la collaboration efficace entre systèmes multi-agents. La solution proposée comprend un générateur de pose léger avec confiance (PGC) pour estimer les poses et la confiance, ainsi qu'un transformateur d'alignement spatio-temporel conscient de la pose (PASTAT) pour l'alignement spatial. Un nouveau jeu de données de simulat…

The article presents a new framework for GNSS-free collaborative perception using LiDAR localization, addressing the challenges faced in GNSS-denied environments. Traditional localization methods often struggle in these settings, hindering effective collaboration among multi-agent systems. The proposed solution includes a lightweight Pose Generator with Confidence (PGC) for estimating poses and confidence, alongside the Pose-Aware Spatio-Temporal Alignment Transformer (PASTAT) for spatial alignment. A new simulation dataset, V2VLoc, is introduced, which supports LiDAR localization and collabor…

V2VLoc: Robust GNSS-Free Collaborative Perception via LiDAR Localization

arXiv:2511.14210v1 Announce Type: cross 
Abstract: We introduce Orion, a visual agent framework that can take in any modality and generate any modality. Using an agentic framework with multiple tool-calling capabilities, Orion is designed for visual AI tasks and achieves state-of-the-art results. Unlike traditional vision-language models that produce descriptive outputs, Orion orchestrates a suite of specialized computer vision tools, including object detection, keypoint localization, panoptic segmentation, Optical Character Recognition, and geometric analysis, to execute complex multi-step visual workflows. The system achieves competitive performance on MMMU, MMBench, DocVQA, and MMLongBench while extending monolithic vision-language models to production-grade visual intelligence. By combining neural perception with symbolic execution, Orion enables autonomous visual reasoning, marking a transition from passive visual understanding to active, tool-driven visual intelligence.

أورايون هو إطار جديد لوكيل بصري قادر على معالجة وتوليد أنماط متعددة. يستخدم إطارًا وكيلًا مع قدرات متعددة لاستدعاء الأدوات، محققًا نتائج رائدة في مهام الذكاء الاصطناعي البصري. على عكس نماذج الرؤية-اللغة التقليدية، يستخدم أورايون أدوات رؤية حاسوبية متخصصة لتنفيذ سير عمل بصري معقد، محققًا أداءً تنافسيًا في معايير مثل MMMU وMMBench وDocVQA وMMLongBench. يمثل هذا النظام تحولًا نحو الاستدلال البصري المستقل، مما يعزز الذكاء البصري.

Orion es un nuevo marco de agente visual capaz de procesar y generar diversas modalidades. Utiliza un marco agentivo con múltiples capacidades de llamada a herramientas, logrando resultados de vanguardia en tareas de IA visual. A diferencia de los modelos tradicionales de visión-lenguaje, Orion emplea herramientas especializadas de visión por computadora para flujos de trabajo visuales complejos, alcanzando un rendimiento competitivo en benchmarks como MMMU, MMBench, DocVQA y MMLongBench. Este sistema marca una transición hacia el razonamiento visual autónomo, mejorando la inteligencia visual.

Orion est un nouveau cadre d'agent visuel capable de traiter et de générer diverses modalités. Il utilise un cadre agentique avec plusieurs capacités d'appel d'outils, atteignant des résultats de pointe dans les tâches d'IA visuelle. Contrairement aux modèles traditionnels de vision-langage, Orion utilise des outils de vision par ordinateur spécialisés pour des flux de travail visuels complexes, obtenant des performances compétitives sur des benchmarks tels que MMMU, MMBench, DocVQA et MMLongBench. Ce système marque un tournant vers le raisonnement visuel autonome, améliorant l'intelligence vi…

Orion is a newly introduced visual agent framework capable of processing and generating various modalities. It employs an agentic framework with multiple tool-calling capabilities, achieving state-of-the-art results in visual AI tasks. Unlike traditional vision-language models, Orion utilizes specialized computer vision tools for complex visual workflows, achieving competitive performance on benchmarks like MMMU, MMBench, DocVQA, and MMLongBench. This system marks a shift towards autonomous visual reasoning, enhancing visual intelligence.

MicroVQA++: High-Quality Microscopy Reasoning Dataset with Weakly Supervised Graphs for Multimodal Large Language Model

Was this article worth reading? Share it