Comparing Large Language Models on Scrum Certification-Style Questions: Accuracy, Stability, and Error Patterns

arXiv:2607.00048v1 Announce Type: cross Abstract: Large Language Models (LLMs) are increasingly used in exam- and certification-style question answering tasks, where their ability to retrieve, interpret, and apply domain-specific knowledge can be systematically assessed. In Software Engineering, such settings are particularly relevant when questions depend on strict adherence to normative definitions, roles, artifacts, and rules. This paper evaluates the performance of three contemporary LLMs, \textit{GPT-5 mini}, \textit{Gemini 3 Flash}, and \textit{DeepSeek Chat 3.2}, in answering 993 Scrum
The proliferation of advanced LLMs necessitates systematic evaluation of their domain-specific capabilities, particularly for certification and regulated fields.
The ability of LLMs to perform well on precise, normative domain questions indicates their readiness for professional applications, impacting white-collar workflows.
The explicit benchmarking of LLMs against rigorous professional certification standards provides clearer guidance on their deployment in fields requiring strict adherence to defined practices.
- · AI developers
- · Certification bodies adopting AI tools
- · Software engineering consultancies
- · Traditional human-only certification processes
- · Educational institutions focused solely on rote learning
- · Outdated assessment methodologies
LLMs demonstrate increasingly sophisticated understanding of complex domain-specific knowledge required for professional certifications.
This performance drives faster adoption of AI tools within professional services and education, leading to efficiency gains and workforce displacement.
The definition of 'certified professional' evolves to include AI-assisted roles, potentially lowering barriers to entry in some fields but creating new demands for AI literacy.
This signal links to a primary source. Continuum Brief monitors and indexes it as part of the live intelligence stream — we do not republish source content.
Read at arXiv cs.AI