LLM-as-a-Reviewer: Benchmarking Their Ability, Divergence, and Prompt Injection Resistance as Paper Reviewers

arXiv:2605.25415v1 Announce Type: new Abstract: Large language models (LLMs) are increasingly used in academic peer review, yet their reliability, alignment with human judgment, and robustness to adversarial attacks remain poorly understood. We present a systematic benchmark of LLM-as-a-Reviewer on 898 papers stratified from NeurIPS and ICLR, evaluating 12 LLMs along three axes: rating calibration, divergence from human reviewers, and resistance to prompt injection embedded via an invisible font-mapping attack. We find that LLMs systematically overrate weaker submissions and diverge from human
The increasing integration of LLMs into academic workflows, including peer review, necessitates a rigorous evaluation of their efficacy and vulnerabilities.
This study provides crucial data on the reliability and limitations of LLMs as reviewers, directly impacting the integrity and efficiency of academic publishing and future AI agent design.
The findings will inform the responsible development and deployment of LLM-based tools in critical human-supervised tasks, highlighting areas needing improvement for trustworthy AI.
- · AI ethics researchers
- · Academic publishers leveraging nuanced AI
- · Developers of robust LLM evaluation systems
- · Developers of uncurated LLM review tools
- · Academic integrity if findings are ignored
Increased scrutiny and demand for more robust, less biased LLM review systems or augmented human review processes.
Development of specialized LLMs trained explicitly for peer review, incorporating ethical guidelines and human-aligned judgment benchmarks.
A broader shift in how academia defines and maintains quality control, with human-AI collaboration becoming a standard, ethically codified process.
This signal links to a primary source. Continuum Brief monitors and indexes it as part of the live intelligence stream — we do not republish source content.
Read at arXiv cs.CL