SIGNALAI·Jun 30, 2026, 4:00 AMSignal75Short term

Can MLLMs Critique Like Humans? Evaluating Open-Ended Aesthetic Reasoning in Multimodal Large Language Models

arXiv:2606.29689v1 Announce Type: new Abstract: Open-ended aesthetic critique is a challenge for multimodal large language models (MLLMs): unlike multiple-choice aesthetic benchmarks, it has no single correct answer, and most aesthetic evaluation has measured models against numeric scores rather than the written critiques people actually give. We evaluate MLLM critiques against ranked human references and ask whether they are close to human ones. Using the Reddit Photo Critique Dataset, we score five open-weight MLLMs against multiple ranked human critiques per photo with reference-based simil

Why this matters

Why now

The rapid advancement and widespread deployment of multimodal LLMs necessitate a deeper understanding of their capabilities in complex, subjective tasks like aesthetic reasoning beyond simple benchmarks.

Why it’s important

Evaluating MLLMs on open-ended aesthetic critique, rather than just numeric scores, is critical for understanding their potential for nuanced human-like interaction and the depth of their 'understanding' beyond rote memorization.

What changes

The focus on open-ended human-like critique provides a more rigorous validation method for MLLMs, shifting evaluation from objective metrics to subjective, qualitative assessments closer to human judgment.

Winners

· Multimodal LLM developers
· Generative AI content creators
· Creative industries

Losers

· MLLMs with poor aesthetic reasoning
· Aesthetic evaluation solely based on quantitative metrics

Second-order effects

Direct

Improved MLLMs will be capable of more sophisticated content generation and critique in creative fields.

Second

The ability to generate human-like critiques could lead to MLLMs becoming valuable co-creators or even primary critics in artistic and design processes.

Third

This could fundamentally alter workflows in creative industries, blurring lines between human and AI artistic evaluation and output.

Editorial confidence: 90 / 100 · Structural impact: 60 / 100

Original report

This signal links to a primary source. Continuum Brief monitors and indexes it as part of the live intelligence stream — we do not republish source content.

Read at arXiv cs.CL

#cs.CL

Tracked by The Continuum Brief · live intelligence network

The Brief · Weekly Dispatch

Stay ahead of the systems reshaping markets.