Aligning MLLM Benchmark With Human Preferences via Structural Equation Modeling

Aligning MLLM Benchmark With Human Preferences via Structural Equation Modeling

arXiv — cs.CL•Friday, November 14, 2025 at 5:00:00 AM

arXiv:2506.21572v2 Announce Type: replace Abstract: Evaluating multimodal large language models (MLLMs) is fundamentally challenged by the absence of structured, interpretable, and theoretically grounded benchmarks; current heuristically-grouped tasks have vague cognitive targets, overlapping abilities, redundant indicators, and weak diagnostic power. We therefore propose a structural-equation-modeling-aligned framework that quantifies internal validity, dimensional separability, and component contributions, and introduce a Piaget-inspired capability hierarchy that stratifies MLLM abilities into Perception, Memory, and Reasoning. Reorganizing existing tasks under this theory, we build the GOLD benchmark, whose experiments show superior interpretability, lower indicator redundancy, and clearer cognitive consistency than prior benchmarks.

— via World Pulse Now AI Editorial System

Aligning MLLM Benchmark With Human Preferences via Structural Equation Modeling

Was this article worth reading? Share it

Ready to build your own newsroom?