Architecture-Sensitive Supervised Fine-Tuning for Screen-Conditioned Action Prediction: A PiSAR Benchmark

arXiv:2605.29400v1 Announce Type: new Abstract: We benchmark three supervised fine-tuned models against frontier zero-shot baselines on a 661-row held-out slice of PiSAR (Persona, intent, Screen, Action, Rationale), a 12,929-tuple corpus of screen-anchored behavioural rationales curated from public app-store reviews, Pew American Trends Panel demographics, and the OPeRA shopper traces. Every model, frontier or fine-tuned, is evaluated on the same 661-row slice with the same scoring pipeline. Two findings. First, frontier zero-shot baselines (Claude Opus 4.7 and GPT-5.5) reach sem_sim 0.459 and
The continuous advancements in AI model capabilities and the increasing need for robust benchmarks for agentic systems mean that fine-tuned models are under constant scrutiny and development.
This benchmark highlights the crucial role of architecture-sensitive supervised fine-tuning in achieving superior performance for screen-conditioned action prediction, which is central to building advanced AI agents.
The research suggests that fine-tuned models, even with smaller datasets, can outperform frontier zero-shot baselines for specific, complex tasks, shifting focus towards targeted model optimization.
- · AI model developers
- · Enterprise software
- · Generative AI platforms
- · AI researchers
- · General-purpose zero-shot AI models (for specific tasks)
- · Companies relying solely on large, untuned models
Improved performance of AI agents capable of understanding and interacting with digital interfaces.
Accelerated development of more sophisticated AI applications that precisely interpret user intent from screen environments.
Enhanced automation of complex digital workflows, potentially disrupting traditional software and service industries.
This signal links to a primary source. Continuum Brief monitors and indexes it as part of the live intelligence stream — we do not republish source content.
Read at arXiv cs.AI