Enhancing the Outcome Reward-based RL Training of MLLMs with Self-Consistency Sampling
NeutralArtificial Intelligence
- a dominant format for multimodal reasoning benchmarks
- the paradigm faces a significant yet often overlooked obstacle: unfaithful trajectories that guess the correct option after a faulty chain of thought receive the same reward as genuine reasoning, which is a flaw that cannot be ignored. We propose Self
— via World Pulse Now AI Editorial System