GUI-AIMA: Aligning Intrinsic Multimodal Attention with a Context Anchor for GUI Grounding

arXiv — cs.LGWednesday, November 12, 2025 at 5:00:00 AM
The introduction of the GUI-AIMA framework marks a significant advancement in the field of graphical user interface (GUI) grounding, which is crucial for computer-use agents to effectively translate natural-language instructions into actionable screen regions. Traditional methods using multimodal large language models (MLLMs) often struggle with the computational demands of generating precise coordinates from visual inputs. GUI-AIMA addresses this by aligning the intrinsic multimodal attention of MLLMs with context anchors, allowing for a more intuitive selection of relevant visual patches before determining precise click locations. This innovative approach not only enhances the mapping of natural-language instructions but also showcases exceptional data efficiency, as evidenced by its training on just 85,000 screenshots. The framework achieved an impressive average accuracy of 59.6% on the ScreenSpot-Pro benchmark, positioning it as a state-of-the-art solution among 3B models. The imp…
— via World Pulse Now AI Editorial System

Was this article worth reading? Share it

Ready to build your own newsroom?

Subscribe to unlock a personalised feed, podcasts, newsletters, and notifications tailored to the topics you actually care about