
arXiv:2606.13218v1 Announce Type: new Abstract: Arabic and Hebrew, as closely related Semitic languages, share a substantial lexicon of true cognates, misleading false friends, and modern loanwords. This overlap poses a challenge for cross-lingual semantic understanding in large language models (LLMs). To evaluate this capability, we introduce SemCog Bench, a curated benchmark of 1,858 Arabic--Hebrew word pairs with sentence-level annotations for cognate identification and semantic disambiguation. We evaluate open-source and commercial LLMs across multiple input representations (raw, diacritiz
The proliferation of LLMs and their deployment in diverse linguistic contexts necessitates robust evaluation, particularly where languages pose unique semantic challenges.
This research highlights a critical limitation in current LLMs regarding cross-lingual semantic understanding in related but distinct languages, impacting AI utility in linguistically complex regions.
The introduction of SemCog Bench provides a new, specific evaluation tool for LLM semantic capabilities in Arabic-Hebrew, revealing weaknesses that require further model development.
- · AI researchers in linguistics
- · Developers focused on regional AI applications
- · Users of specialized translation/NLP services
- · Generic LLMs lacking nuanced cross-lingual understanding
- · Companies relying on superficial multilingual AI capabilities
Identification of specific failure points in LLMs for Arabic-Hebrew language pairs.
Development of new LLM architectures or fine-tuning techniques to address these cross-lingual semantic challenges.
Enhanced, more reliable AI applications for communication and information processing in the Middle East.
This signal links to a primary source. Continuum Brief monitors and indexes it as part of the live intelligence stream — we do not republish source content.
Read at arXiv cs.CL