arXiv:2512.02423v1 Announce Type: new 
Abstract: With the rapid development of Large Vision Language Models, the focus of Graphical User Interface (GUI) agent tasks shifts from single-screen tasks to complex screen navigation challenges. However, real-world GUI environments, such as PC software and mobile Apps, are often complex and proprietary, making it difficult to obtain the comprehensive environment information needed for agent training and evaluation. This limitation hinders systematic investigation and benchmarking of agent navigation capabilities. To address this limitation, we introduce GUI Exploration Lab, a simulation environment engine for GUI agent navigation research that enables flexible definition and composition of screens, icons, and navigation graphs, while providing full access to environment information for comprehensive agent training and evaluation. Through extensive experiments, we find that supervised fine-tuning enables effective memorization of fundamental knowledge, serving as a crucial foundation for subsequent training. Building on this, single-turn reinforcement learning further enhances generalization to unseen scenarios. Finally, multi-turn reinforcement learning encourages the development of exploration strategies through interactive trial and error, leading to further improvements in screen navigation performance. We validate our methods on both static and interactive benchmarks, demonstrating that our findings generalize effectively to real-world scenarios. These findings demonstrate the advantages of reinforcement learning approaches in GUI navigation and offer practical guidance for building more capable and generalizable GUI agents.

تم تقديم مختبر استكشاف واجهة المستخدم الرسومية كبيئة محاكاة تهدف إلى تحسين التنقل على الشاشة للوكالات من خلال التعلم المعزز متعدد الأدوار. يعالج هذا التطور التحديات التي تطرحها البيئات المعقدة والملكية لواجهة المستخدم الرسومية في التطبيقات الواقعية، مثل برامج الكمبيوتر وتطبيقات الهواتف المحمولة، والتي تعيق التدريب والتقييم الفعالين للوكالات.

Se ha introducido el GUI Exploration Lab como un motor de entorno de simulación destinado a mejorar la navegación en pantalla para agentes mediante el aprendizaje por refuerzo de múltiples turnos. Este desarrollo aborda los desafíos que presentan los entornos GUI complejos y propietarios en aplicaciones del mundo real, como software de PC y aplicaciones móviles, que dificultan la formación y evaluación efectivas de los agentes.

Le GUI Exploration Lab a été introduit comme un moteur d'environnement de simulation visant à améliorer la navigation à l'écran pour les agents via l'apprentissage par renforcement multi-tour. Ce développement répond aux défis posés par les environnements GUI complexes et propriétaires dans les applications réelles, telles que les logiciels PC et les applications mobiles, qui entravent la formation et l'évaluation efficaces des agents.

The GUI Exploration Lab has been introduced as a simulation environment engine aimed at enhancing screen navigation for agents through multi-turn reinforcement learning. This development addresses the challenges posed by complex and proprietary GUI environments in real-world applications, such as PC software and mobile apps, which hinder effective agent training and evaluation.

GUI Exploration Lab: Enhancing Screen Navigation in Agents via Multi-Turn Reinforcement Learning

arXiv:2512.07141v1 Announce Type: new 
Abstract: As multimodal reasoning improves the overall capabilities of Large Vision Language Models (LVLMs), recent studies have begun to explore safety-oriented reasoning, aiming to enhance safety awareness by analyzing potential safety risks during the reasoning process before generating the final response. Although such approaches improve safety awareness and interpretability, this single-pass think-then-answer paradigm remains vulnerable to contextual or visual jailbreak attacks. This reveals a critical flaw: single-pass reasoning may overlook explicit harmful content in its own output. Our key insight is to exploit this wasted signal through reflection, which can effectively leverage the malicious content revealed in the first-pass reasoning to enable genuine self-correction and prevent unsafe generations. Motivated by this, we propose Think-Reflect-Revise (TRR), a three-stage training framework designed to enhance the safety alignment of LVLMs through policy-guided self-reflection. We first build a Reflective Safety Reasoning (ReSafe) dataset with 5,000 examples that follow a think-reflect-revise process. We then fine-tune the target model using the ReSafe dataset to initialize reflective behavior, and finally reinforce policy-guided reflection through reinforcement learning. Experimental results show that TRR substantially improves the safety performance of LVLMs across both safety-awareness benchmarks and jailbreak attack evaluations, increasing the overall safe response rate from 42.8% to 87.7% on Qwen2.5-VL-7B, while preserving stable performance on general benchmarks such as MMMU and MMStar. The project page is available at https://think-reflect-revise.github.io/.

تم اقتراح إطار جديد يسمى Think-Reflect-Revise (TRR) لتحسين توافق الأمان لنماذج اللغة البصرية الكبيرة (LVLMs) من خلال دمج عملية تدريب من ثلاث مراحل تسمح بالتصحيح الذاتي أثناء التفكير. تتناول هذه الطريقة الثغرات في التفكير من خلال تمرير واحد والتي قد تتجاهل المحتوى الضار في المخرجات.

Se ha propuesto un nuevo marco llamado Think-Reflect-Revise (TRR) para mejorar la alineación de seguridad de los Grandes Modelos de Lenguaje Visual (LVLMs) mediante un proceso de entrenamiento en tres etapas que permite la autocorrección durante el razonamiento. Este enfoque aborda las vulnerabilidades en el razonamiento de paso único que pueden pasar por alto contenido dañino en las salidas.

Un nouveau cadre appelé Think-Reflect-Revise (TRR) a été proposé pour améliorer l'alignement de la sécurité des grands modèles de langage visuel (LVLMs) en intégrant un processus de formation en trois étapes qui permet l'auto-correction lors du raisonnement. Cette approche traite les vulnérabilités dans le raisonnement à passage unique qui peuvent négliger le contenu nuisible dans les sorties.

A new framework called Think-Reflect-Revise (TRR) has been proposed to enhance the safety alignment of Large Vision Language Models (LVLMs) by incorporating a three-stage training process that allows for self-correction during reasoning. This approach addresses vulnerabilities in single-pass reasoning that may overlook harmful content in outputs.

GUI Exploration Lab: Enhancing Screen Navigation in Agents via Multi-Turn Reinforcement Learning

Was this article worth reading? Share it

LucidQuery AI

Guidejar-4eb95b

Pawss