Enhancing the Outcome Reward-based RL Training of MLLMs with Self-Consistency Sampling

arXiv — cs.CVFriday, November 14, 2025 at 5:00:00 AM
  • a dominant format for multimodal reasoning benchmarks
  • the paradigm faces a significant yet often overlooked obstacle: unfaithful trajectories that guess the correct option after a faulty chain of thought receive the same reward as genuine reasoning, which is a flaw that cannot be ignored. We propose Self
— via World Pulse Now AI Editorial System

Was this article worth reading? Share it