PersianMedQA: Evaluating Large Language Models on a Persian-English Bilingual Medical Question Answering Benchmark

arXiv:2506.00250v4 Announce Type: replace Abstract: Large Language Models (LLMs) have achieved remarkable performance on a wide range of Natural Language Processing (NLP) benchmarks, often surpassing human-level accuracy. However, their reliability in high-stakes domains such as medicine, particularly in low-resource languages, remains underexplored. In this work, we introduce PersianMedQA, a large-scale dataset of 20,785 expert-validated multiple-choice Persian medical questions from 14 years of Iranian national medical exams, spanning 23 medical specialties and designed to evaluate LLMs in b
The proliferation of powerful LLMs and growing concerns about their performance in specific, high-stakes, and diverse linguistic contexts makes this evaluation timely and crucial.
This benchmark highlights the critical need for regional and language-specific AI development to ensure reliable LLM deployment in vital sectors like healthcare, especially outside of dominant languages.
The availability of a robust, expert-validated, bilingual medical dataset for Persian will accelerate the development and refinement of LLMs for low-resource languages in critical applications.
- · Iranian AI developers
- · Persian-speaking medical professionals
- · Patients in Iran
- · LLM developers focused on multilingual capabilities
- · LLMs with poor multilingual medical reasoning
- · Monolingual AI solutions
The new dataset enables better evaluation and training of LLMs for medical use in Persian and other low-resource languages.
Improved clinical decision support systems and medical information access become possible for Persian speakers.
This could foster sovereign AI initiatives in Iran, reducing reliance on foreign-developed medical AI, and potentially inspiring similar efforts in other countries with low-resource languages.
This signal links to a primary source. Continuum Brief monitors and indexes it as part of the live intelligence stream — we do not republish source content.
Read at arXiv cs.CL