
arXiv:2604.14262v2 Announce Type: replace Abstract: GUI grounding models report over 85% accuracy on standard benchmarks, yet drop 27-56 percentage points when instructions require spatial reasoning rather than direct element naming. Current benchmarks miss this because they evaluate each screenshot once with a single fixed instruction. We introduce GUI-Perturbed, a controlled perturbation framework that independently varies visual scenes and instructions to measure grounding robustness. Evaluating three 7B models from the same architecture lineage, we find that relational instructions cause s
This research arrives as AI models, particularly large language models, are increasingly being applied to interface understanding and automation, highlighting critical limitations in their current capabilities.
A strategic reader needs to understand the current brittleness of GUI grounding models, as it impacts the reliability and trustworthiness of AI systems designed for human-computer interaction and automation.
The understanding of AI model robustness in GUI interaction is challenged, emphasizing that high benchmark scores do not equate to real-world reliability, especially with spatial reasoning tasks.
- · Companies developing more robust, spatially aware AI architectures
- · Developers focused on comprehensive, perturbation-resistant AI evaluation
- · Researchers exploring novel grounding techniques
- · Companies deploying brittle GUI-focused AI models prematurely
- · Automation platforms reliant on simple element naming rather than complex spatia
- · Benchmarks that do not test for diverse scenarios and adversarial perturbations
System developers will need to adopt more rigorous testing and evaluation methodologies for GUI-interacting AI.
This will drive increased investment in multimodal AI research focusing on advanced spatial and relational reasoning.
It could lead to a bifurcation of AI applications: those requiring high robustness (e.g., enterprise automation) will adopt more advanced, potentially slower, models, while less critical applications may continue with current architectures.
This signal links to a primary source. Continuum Brief monitors and indexes it as part of the live intelligence stream — we do not republish source content.
Read at arXiv cs.LG