IsoSci: A Benchmark of Isomorphic Cross-Domain Science Problems for Evaluating Reasoning versus Knowledge Retrieval in LLMs

arXiv:2607.01431v1 Announce Type: new Abstract: We introduce ISOSCI, a benchmark of isomorphic cross-domain science problem pairs that separates reasoning ability from domain knowledge retrieval in LLM evaluation. Each pair shares identical logical structure but requires different domain-specific knowledge, enabling controlled attribution of reasoning-mode gains. Across five model pairs spanning four model families, we find that 91.3% of reasoning-mode gains are knowledge-dependent rather than structure-invariant (63/69 gains; Wilson 95% CI [82.3%, 96.0%]), directly challenging the assumption
The proliferation of advanced LLMs necessitates more precise evaluation methods to understand their true capabilities beyond superficial performance metrics.
This benchmark helps dissect whether LLMs are truly 'reasoning' or merely retrieving information, which is critical for their development and deployment in complex tasks.
LLM evaluation protocols will need to evolve to specifically test reasoning abilities, potentially shifting research focus towards architecture that enhances true logical processing.
- · LLM developers focusing on reasoning architecture
- · Evaluation framework developers
- · AI safety researchers
- · LLMs optimized primarily for knowledge retrieval
- · Companies overselling 'reasoning' capabilities without rigorous proof
Further research and development will likely focus on improving the reasoning capabilities of large language models.
New architectural breakthroughs might emerge that genuinely separate reasoning from knowledge, leading to more robust and less 'hallucinatory' AI.
The application of LLMs in highly sensitive domains requiring verifiable reasoning, such as legal or medical diagnosis, could become more trustworthy.
This signal links to a primary source. Continuum Brief monitors and indexes it as part of the live intelligence stream — we do not republish source content.
Read at arXiv cs.CL