
arXiv:2601.05366v2 Announce Type: replace-cross Abstract: Large Language Models (LLMs) are increasingly deployed as agents that invoke external tools through structured function calls. While recent work reports strong tool-calling performance under standard English-centric evaluations, the robustness of tool calling under multilingual user interactions remains underexplored. In this work, we introduce MLCL, a diagnostic benchmark, and conduct a systematic evaluation of multilingual tool calling across Chinese, Hindi, and the low-resource language Igbo. Through fine-grained error analysis, we s
The rapid deployment of LLMs as agents necessitates understanding their real-world performance beyond English, making multilingual robustness a critical area of current research.
This research highlights critical limitations in LLM agent capabilities in non-English contexts, impacting global AI adoption and the reliability of AI-driven automation in diverse linguistic environments.
The findings challenge the assumption of universal LLM agent robustness, introducing a needed focus on multilingual evaluation and development for practical, global applications.
- · Developers focused on multilingual LLM safety and robustness
- · Companies operating in non-English speaking markets
- · Research institutions specializing in linguistic diversity for AI
- · LLM providers with English-centric training and evaluation pipelines
- · Businesses deploying LLM agents without multilingual testing
- · Early adopters in non-English regions relying on untested solutions
Immediate emphasis will be placed on developing and evaluating multilingual capabilities for LLM tool-calling.
This will lead to the development of new benchmarks and datasets tailored to measure robustness across a wider array of languages, including low-resource ones.
Ultimately, the findings could drive a bifurcation in the AI market, with specialized multilingual LLM agents gaining prominence over generalist, English-first models for global enterprise applications.
This signal links to a primary source. Continuum Brief monitors and indexes it as part of the live intelligence stream — we do not republish source content.
Read at arXiv cs.LG