Anatomy-Guided Vision-Language Learning with Angular Prototype Separation for Multi-Label Video Capsule Endoscopy Classification Under Class Imbalance

arXiv:2603.17879v2 Announce Type: replace-cross Abstract: This work presents a multi-label temporal event detection framework for video capsule endoscopy (VCE) that addresses the extreme class imbalance inherent in the Galar dataset by combining two principal contributions: an Angular Separation Loss on class prototypes and a Biological State Machine temporal decoder. The backbone remains BiomedCLIP, a biomedical vision-language foundation model. Three consecutive frames are fused through a Local Differencing Attention module that amplifies transient pathological signals by suppressing static
The continuous development in AI and vision-language models enables more sophisticated and robust diagnostic tools for medical applications.
This development can significantly improve the accuracy and efficiency of disease detection in medical imaging, particularly in challenging scenarios like video capsule endoscopy.
The ability to accurately classify complex medical conditions from video data, even with class imbalance, suggests a path toward more autonomous and reliable diagnostic systems.
- · Med-tech companies
- · Gastroenterologists
- · AI healthcare researchers
- · Patients with digestive disorders
- · Traditional manual diagnostic processes
- · Developers of less robust medical AI models
Improved early detection rates for various conditions identified via video capsule endoscopy.
Reduced healthcare costs through more efficient and accurate diagnostic workflows, potentially expanding access to screenings.
The integration of such AI systems could lead to a broader automation of diagnostic tasks, shifting the role of human specialists towards oversight and complex case management.
This signal links to a primary source. Continuum Brief monitors and indexes it as part of the live intelligence stream — we do not republish source content.
Read at arXiv cs.AI