Beyond Clean Text: Evaluating Encoder and Decoder Robustness for Bangla Event Detection in Noisy Text

arXiv:2606.30914v1 Announce Type: new Abstract: Event detection (ED) systems are typically evaluated on clean, curated text, leaving their robustness to real-world noise largely unexplored, particularly for low-resource languages such as Bangla. We introduce a generalized Bangla news event ontology and a benchmark comprising 9,979 annotated sentences across 40 event subtypes, spanning clean news text, real-world Automatic Speech Recognition (ASR) transcripts, and orthographically corrupted text. We systematically evaluate fine-tuned encoder-only models (BanglaBERT and XLM-R) alongside instruct
The increasing focus on real-world AI applications, especially in diverse linguistic contexts, necessitates robust evaluation against noisy data, which previous benchmarks often omitted.
Evaluating AI models like event detection systems on noisy, real-world data, particularly for low-resource languages, is crucial for developing practical, globally applicable AI solutions.
The availability of a new benchmark for Bangla event detection, including noisy text, allows for more realistic assessment and improvement of language models for non-English contexts.
- · Bangla NLP researchers
- · AI developers in emerging markets
- · Multilingual AI platforms
- · Low-resource language communities
- · Mono-lingual AI development
- · AI models not robust to noise
- · AI evaluation based solely on clean datasets
Improved performance of event detection systems for Bangla in real-world scenarios.
Accelerated development of robust AI models for other low-resource languages facing similar noisy data challenges.
Increased adoption and utility of AI applications in diverse linguistic environments, potentially reducing digital inequality.
This signal links to a primary source. Continuum Brief monitors and indexes it as part of the live intelligence stream — we do not republish source content.
Read at arXiv cs.CL