
arXiv:2606.13111v1 Announce Type: new Abstract: We present M\"OVE (Modelle f\"ur die \"Offentliche Verwaltung Evaluieren), a holistic benchmark for evaluating large language models (LLMs) in the context of the German public sector. While LLMs are increasingly adopted in public administration, model selection remains largely ad hoc, and existing benchmarks offer limited guidance: they are predominantly English-centric, US-centric in content, and focus exclusively on task performance. M\"OVE addresses these gaps by evaluating 39 models across two complementary dimensions. Performance criteria co
The increasing adoption of LLMs in public administration necessitates specialized benchmarks to ensure appropriate model selection and performance for specific governmental contexts.
This benchmark addresses the critical need for localized, culturally relevant, and domain-specific evaluation of AI models, preventing reliance on generic, often Anglo-centric, benchmarks.
Model selection for public sector AI deployments in Germany will become significantly more robust, prioritizing models that excel in German language, legal frameworks, and administrative tasks.
- · German public sector
- · LLM developers specializing in German language and localized data
- · Consultancies supporting AI deployment in Germany
- · Generic, English-centric LLMs
- · US-centric AI benchmark developers
Increased adoption of LLMs tailored for the German public sector, improving administrative efficiency and service delivery.
Other nations or blocs will likely follow suit, developing their own localized AI benchmarks, further diversifying the AI development landscape.
This could lead to a fragmentation of the global AI market, with distinct regional ecosystems of models, data, and evaluation standards emerging for sensitive sectors like government.
This signal links to a primary source. Continuum Brief monitors and indexes it as part of the live intelligence stream — we do not republish source content.
Read at arXiv cs.CL