Can MLLMs Critique Like Humans? Evaluating Open-Ended Aesthetic Reasoning in Multimodal Large Language Models

arXiv:2606.29689v1 Announce Type: new Abstract: Open-ended aesthetic critique is a challenge for multimodal large language models (MLLMs): unlike multiple-choice aesthetic benchmarks, it has no single correct answer, and most aesthetic evaluation has measured models against numeric scores rather than the written critiques people actually give. We evaluate MLLM critiques against ranked human references and ask whether they are close to human ones. Using the Reddit Photo Critique Dataset, we score five open-weight MLLMs against multiple ranked human critiques per photo with reference-based simil
The rapid advancement and widespread deployment of multimodal LLMs necessitate a deeper understanding of their capabilities in complex, subjective tasks like aesthetic reasoning beyond simple benchmarks.
Evaluating MLLMs on open-ended aesthetic critique, rather than just numeric scores, is critical for understanding their potential for nuanced human-like interaction and the depth of their 'understanding' beyond rote memorization.
The focus on open-ended human-like critique provides a more rigorous validation method for MLLMs, shifting evaluation from objective metrics to subjective, qualitative assessments closer to human judgment.
- · Multimodal LLM developers
- · Generative AI content creators
- · Creative industries
- · MLLMs with poor aesthetic reasoning
- · Aesthetic evaluation solely based on quantitative metrics
Improved MLLMs will be capable of more sophisticated content generation and critique in creative fields.
The ability to generate human-like critiques could lead to MLLMs becoming valuable co-creators or even primary critics in artistic and design processes.
This could fundamentally alter workflows in creative industries, blurring lines between human and AI artistic evaluation and output.
This signal links to a primary source. Continuum Brief monitors and indexes it as part of the live intelligence stream — we do not republish source content.
Read at arXiv cs.CL