CourseTimeQA: A Lecture-Video Benchmark and a Latency-Constrained Cross-Modal Fusion Method for Timestamped QA

arXiv:2512.00360v2 Announce Type: replace Abstract: We study timestamped question answering over educational lecture videos under a single-GPU latency/memory budget. Given a natural-language query, the system retrieves relevant timestamped segments and synthesizes a grounded answer. We present CourseTimeQA (52.3 h, 902 queries across six courses) and a lightweight, latency-constrained cross-modal retriever (CrossFusion-RAG) that combines frozen encoders, a learned 512->768 vision projection, shallow query-agnostic cross-attention over ASR and frames with a temporal-consistency regularizer, and
The proliferation of lecture videos and the demand for efficient information retrieval from multimedia content drives the need for sophisticated timestamped QA systems.
This development improves access and utility of educational content, potentially accelerating skill acquisition and knowledge transfer within both academic and corporate settings.
The ability to precisely extract and synthesize answers from video lectures under latency constraints makes video content more amenable to automated, query-based learning and research.
- · Education technology platforms
- · Students and lifelong learners
- · AI researchers in multimedia QA
- · Content creators using video
- · Traditional manual video indexing services
Increased efficiency in information retrieval from educational video content becomes standard.
Development of more advanced AI agents capable of autonomous learning from diverse multimedia sources accelerates.
The democratization of access to specialized knowledge through AI-powered search and synthesis capabilities alters traditional educational structures and the competitive landscape for expertise.
This signal links to a primary source. Continuum Brief monitors and indexes it as part of the live intelligence stream — we do not republish source content.
Read at arXiv cs.CL