From PDF to RAG-Ready: Evaluating Document Conversion Frameworks for Domain-Specific Question Answering

arXiv:2604.04948v2 Announce Type: replace-cross Abstract: Retrieval-Augmented Generation (RAG) systems depend critically on the quality of document preprocessing, yet no prior study has evaluated PDF processing frameworks by their impact on downstream question-answering accuracy. We address this gap through a systematic comparison of four open-source PDF-to-Markdown conversion frameworks, Docling, MinerU, Marker, and DeepSeek OCR, across 21 pipeline configurations, varying the conversion tool, cleaning transformations, splitting strategy, and metadata enrichment. Evaluation was performed using
The proliferation of Retrieval-Augmented Generation (RAG) systems highlights the acute need for effective data preparation, especially from unstructured documents like PDFs, making this evaluation timely.
The performance of RAG systems is fundamentally bottlenecked by the quality of input data; advancements in PDF-to-RAG conversion directly improve AI system reliability and expand deployable use cases.
This research provides a framework for evaluating and selecting optimal document conversion methods, enabling better decision-making for RAG system development and deployment.
- · AI developers
- · Enterprises adopting RAG
- · Open-source PDF conversion tools
- · Data scientists
- · Organizations with poor data handling practices
- · Inefficient document processing vendors
Improved accuracy and efficiency of RAG-based question answering systems across various domains.
Accelerated adoption of RAG in industries traditionally reliant on large archives of PDF documents.
The emergence of new startups specializing in highly optimized, domain-specific document pre-processing for AI applications.
This signal links to a primary source. Continuum Brief monitors and indexes it as part of the live intelligence stream — we do not republish source content.
Read at arXiv cs.LG