Chain of Evidence: Pixel-Level Visual Attribution for Iterative Retrieval-Augmented Generation

arXiv:2605.01284v2 Announce Type: replace-cross Abstract: Iterative Retrieval-Augmented Generation (iRAG) has emerged as a powerful paradigm for answering complex multi-hop questions by progressively retrieving and reasoning over external documents. However, current systems predominantly operate on parsed text, which creates two critical bottlenecks: (1) \textit{Coarse-grained attribution}, where users are burdened with manually locating evidence within lengthy documents based on vague text-level citations; and (2) \textit{Visual semantic loss}, where the conversion of visually rich documents
The increasing complexity of multimodal AI and RAG systems necessitates more precise attribution methods to improve user trust and system accuracy.
This development addresses critical limitations in how AI systems interpret and present visual information, enhancing the reliability and utility of conversational AI.
AI systems can now provide pixel-level attribution for visual evidence, moving beyond coarse text-based citations and reducing 'visual semantic loss'.
- · AI developers
- · Users of RAG systems
- · Generative AI platforms
- · Systems with coarse-grained attribution
- · Text-only RAG approaches
Improved accuracy and trustworthiness of information retrieved and generated by AI systems, especially those dealing with visual data.
Accelerated development of AI multimodal understanding and the integration of visual reasoning into complex problem-solving.
Potential for new applications in fields requiring precise visual evidence, such as medical diagnostics, legal discovery, and engineering design.
This signal links to a primary source. Continuum Brief monitors and indexes it as part of the live intelligence stream — we do not republish source content.
Read at arXiv cs.CL