GUI-AIMA: Aligning Intrinsic Multimodal Attention with a Context Anchor for GUI Grounding
PositiveArtificial Intelligence
The introduction of the GUI-AIMA framework marks a significant advancement in the field of graphical user interface (GUI) grounding, which is crucial for computer-use agents to effectively translate natural-language instructions into actionable screen regions. Traditional methods using multimodal large language models (MLLMs) often struggle with the computational demands of generating precise coordinates from visual inputs. GUI-AIMA addresses this by aligning the intrinsic multimodal attention of MLLMs with context anchors, allowing for a more intuitive selection of relevant visual patches before determining precise click locations. This innovative approach not only enhances the mapping of natural-language instructions but also showcases exceptional data efficiency, as evidenced by its training on just 85,000 screenshots. The framework achieved an impressive average accuracy of 59.6% on the ScreenSpot-Pro benchmark, positioning it as a state-of-the-art solution among 3B models. The imp…
— via World Pulse Now AI Editorial System