FinVerBench: Benchmark Validity and Calibration in Large Language Model Financial Statement Verification

arXiv:2605.29586v1 Announce Type: new Abstract: We introduce FinVerBench, a benchmark and validity study for financial statement verification: determining whether a set of corporate financial statements is numerically consistent from the information shown to the model. FinVerBench is built from SEC 10-K XBRL filings for 43 S&P 500 companies and defines a four-category error taxonomy covering arithmetic, cross-statement linkage, year-over-year, and magnitude perturbations. We attempt fifteen contemporary LLM evaluations and report fourteen complete runs; a Gemini 2.5 Pro run is excluded from th
The proliferation of large language models makes their application in high-stakes financial verification a natural next step, necessitating robust benchmarking for trustworthiness and practical adoption.
This benchmark helps quantify the reliability of LLMs in critical financial tasks, which is essential for audit, compliance, and automated financial operations, impacting trust and adoption.
The introduction of FinVerBench facilitates standardized evaluation of LLM performance in financial statement verification, allowing for objective comparison and development of more accurate models.
- · AI developers
- · Financial auditing firms
- · Compliance software providers
- · Companies relying on unverified LLM financial tools
- · Traditional manual verification processes
Financial institutions begin integrating LLMs more broadly for automated verification, reducing human effort.
Improved accuracy and reliability of LLM verification tools lead to greater investor confidence and potentially faster financial reporting cycles.
The benchmark could become a de-facto industry standard, accelerating the development of specialized financial large language models and further automating financial regulation.
This signal links to a primary source. Continuum Brief monitors and indexes it as part of the live intelligence stream — we do not republish source content.
Read at arXiv cs.AI