Four research papers published on arXiv on March 30, 2026, present different approaches to improving GUI grounding—the capability for AI agents to map natural-language instructions to actionable regions on computer screens.
According to arxiv.org, GUI-AIMA introduces an attention-based framework that aligns intrinsic multimodal attention in large language models with patch-wise grounding signals. The GUI-AIMA-3B model was trained with only 509k samples (around 101k screenshots) and achieved 61.5% accuracy on ScreenSpot-Pro, 92.1% on ScreenSpot-v2, 68.1% on OSWorld-G, 79.1% on MMBench-GUI-L2, and 60.0% on UI-Vision.
A separate paper on arxiv.org presents Visual Re-Examination (VRE), a framework that enables multimodal models to perform visual introspection during reasoning without additional visual inputs. According to the researchers, models progressively drift from image evidence in long-form generation, falling back on textual priors.
According to arxiv.org, GUIDE addresses domain bias in GUI agents through a training-free framework that acquires domain-specific expertise from web tutorial videos. Testing on OSWorld showed improvements over 5% without modifying model parameters.
Finally, arxiv.org reports that researchers adapted discrete diffusion vision-language models for GUI grounding, proposing a hybrid masking schedule that improved grounding accuracy by up to 6.1 points in Step Success Rate. The paper was accepted to CVPR 2026.