arXiv:2511.09127v1 Announce Type: cross 
Abstract: Advances in Multimodal Large Language Models have significantly enhanced Graphical User Interface (GUI) automation. Equipping GUI agents with reliable episodic reasoning capabilities is essential for bridging the gap between users' concise task descriptions and the complexities of real-world execution. Current methods integrate Reinforcement Learning (RL) with System-2 Chain-of-Thought, yielding notable gains in reasoning enhancement. For long-horizon GUI tasks, historical interactions connect each screen to the goal-oriented episode chain, and effectively leveraging these clues is crucial for the current decision. However, existing native GUI agents exhibit weak short-term memory in their explicit reasoning, interpreting the chained interactions as discrete screen understanding, i.e., unawareness of the historical interactions within the episode. This history-agnostic reasoning challenges their performance in GUI automation. To alleviate this weakness, we propose a History-Aware Reasoning (HAR) framework, which encourages an agent to reflect on its own errors and acquire episodic reasoning knowledge from them via tailored strategies that enhance short-term memory in long-horizon interaction. The framework mainly comprises constructing a reflective learning scenario, synthesizing tailored correction guidelines, and designing a hybrid RL reward function. Using the HAR framework, we develop a native end-to-end model, HAR-GUI-3B, which alters the inherent reasoning mode from history-agnostic to history-aware, equipping the GUI agent with stable short-term memory and reliable perception of screen details. Comprehensive evaluations across a range of GUI-related benchmarks demonstrate the effectiveness and generalization of our method.

أدت التطورات الأخيرة في نماذج اللغة متعددة الوسائط إلى تحسين أتمتة واجهات المستخدم الرسومية (GUI). ومع ذلك، تواجه الوكلاء الحاليون في واجهات المستخدم الرسومية صعوبات في الذاكرة قصيرة المدى في تفكيرهم، مما يؤدي إلى تنفيذ غير فعال للمهام. لمعالجة ذلك، تم اقتراح إطار عمل جديد يسمى التفكير الواعي بالتاريخ (HAR)، الذي يعزز التفكير الإيبيزودي للوكلاء من خلال التفكير في الأخطاء السابقة. هذا التطور مهم لسد الفجوة بين تعليمات المستخدم وتعقيدات المهام في العالم الحقيقي.

Los recientes avances en Modelos de Lenguaje Multimodal han mejorado la automatización de Interfaces Gráficas de Usuario (GUI). Sin embargo, los agentes GUI existentes enfrentan dificultades con la memoria a corto plazo en su razonamiento, lo que lleva a una ejecución ineficaz de tareas. Para abordar esto, se ha propuesto un nuevo marco de Razonamiento Consciente de la Historia (HAR), que mejora el razonamiento episódico de los agentes al reflexionar sobre errores pasados. Este desarrollo es crucial para cerrar la brecha entre las instrucciones del usuario y las complejidades de las tareas en el mundo real.

Les récentes avancées dans les modèles de langage multimodaux ont amélioré l'automatisation des interfaces graphiques (GUI). Cependant, les agents GUI existants ont des difficultés avec la mémoire à court terme dans leur raisonnement, ce qui entraîne une exécution inefficace des tâches. Pour remédier à cela, un nouveau cadre de raisonnement conscient de l'histoire (HAR) a été proposé, qui améliore le raisonnement épisodique des agents en réfléchissant sur les erreurs passées. Ce développement est crucial pour combler le fossé entre les instructions des utilisateurs et les complexités des tâches dans le monde réel.

Recent advancements in Multimodal Large Language Models have improved Graphical User Interface (GUI) automation. However, existing GUI agents struggle with short-term memory in reasoning, leading to ineffective task execution. To address this, a new History-Aware Reasoning (HAR) framework has been proposed, which enhances agents' episodic reasoning by reflecting on past errors. This development is crucial for bridging the gap between user instructions and real-world task complexities.

History-Aware Reasoning for GUI Agents

One More Thing in AI – Your Shortcut to AI Mastery

History-Aware Reasoning for GUI Agents

Was this article worth reading? Share it

One More Thing in AI

LucidQuery AI

Chattermate

Guidejar-4eb95b

ChatOne

Https

Ready to build your own newsroom?