Same Evidence, Different Answer: Auditing Order Sensitivity in Multimodal Large Language Models

arXiv:2606.26079v1 Announce Type: cross Abstract: Standard benchmarks for multimodal large language models (MLLMs) score each item on one canonical ordering and miss whether order-irrelevant shuffling changes the answer, a baseline reliability property called for by emerging AI evaluation guidelines. We introduce Facet-Probe, a five-facet audit (option, evidence-chunk, document-rank, image-set, and mixed-modality ordering) of 18 frontier and open-weight MLLMs. A Bayesian item-response model separates ordering noise from per-facet bias, and a same-ordering control estimates the decoder-stochast
The rapid deployment and increasing reliance on large language models necessitate robust evaluation methods to ensure their reliability and ethical deployment.
This research highlights a fundamental reliability flaw in current MLLMs, impacting their trustworthiness and potentially leading to erratic or biased autonomous decisions in critical applications.
MLLM evaluation will evolve beyond canonical orderings to include order sensitivity, pushing developers to build more robust and context-independent models, thus raising the bar for foundational model reliability.
- · AI evaluation and auditing firms
- · Developers of robust MLLMs
- · Industries relying on reliable AI decision-making
- · MLLM developers whose models perform poorly in order sensitivity tests
- · Users relying on black-box MLLMs for critical applications without auditing
Increased focus on testing and mitigating input order sensitivity in MLLM development pipelines.
New architectural designs or training methodologies emerge to inherently reduce order-dependency in multimodal foundation models.
Regulatory bodies may mandate order sensitivity testing as a standard for AI model deployment in sensitive sectors, influencing procurement and compliance.
This signal links to a primary source. Continuum Brief monitors and indexes it as part of the live intelligence stream — we do not republish source content.
Read at arXiv cs.LG