First SFT, Second RL, Third UPT: Continual Improving Multi-Modal LLM Reasoning via Unsupervised Post-Training

arXiv — cs.CVTuesday, October 28, 2025 at 4:00:00 AM
A recent study highlights the potential of unsupervised post-training as a third stage in enhancing multi-modal large language models (MLLMs). Traditional methods like supervised fine-tuning and reinforcement learning often require costly and labor-intensive data, making them less sustainable. This new approach could revolutionize how we improve MLLMs, making advancements more accessible and efficient. As the demand for sophisticated AI models grows, finding innovative and less resource-intensive methods is crucial for the future of technology.
— via World Pulse Now AI Editorial System

Was this article worth reading? Share it

Recommended Readings
Mind the Gap: Evaluating LLM Understanding of Human-Taught Road Safety Principles
NegativeArtificial Intelligence
This study evaluates the understanding of road safety principles by multi-modal large language models (LLMs), particularly in the context of autonomous vehicles. Using a curated dataset of traffic signs and safety norms from school textbooks, the research reveals that these models struggle with safety reasoning, highlighting significant gaps between human learning and model interpretation. The findings suggest a need for further research to address these performance deficiencies in AI systems governing autonomous vehicles.
Potent but Stealthy: Rethink Profile Pollution against Sequential Recommendation via Bi-level Constrained Reinforcement Paradigm
PositiveArtificial Intelligence
The paper titled 'Potent but Stealthy: Rethink Profile Pollution against Sequential Recommendation via Bi-level Constrained Reinforcement Paradigm' addresses vulnerabilities in sequential recommenders, particularly to adversarial attacks. It highlights the Profile Pollution Attack (PPA), which subtly contaminates user interactions to induce mispredictions. The authors propose a new method called CREAT, which combines bi-level optimization with reinforcement learning to enhance the stealthiness and effectiveness of such attacks, overcoming limitations of previous methods.
LDC: Learning to Generate Research Idea with Dynamic Control
PositiveArtificial Intelligence
Recent advancements in large language models (LLMs) highlight their potential in automating scientific research ideation. Current methods often produce ideas that do not meet expert standards of novelty, feasibility, and effectiveness. To address these issues, a new framework is proposed that combines Supervised Fine-Tuning (SFT) and controllable Reinforcement Learning (RL) to enhance the quality of generated research ideas through a two-stage approach.
Q-Doc: Benchmarking Document Image Quality Assessment Capabilities in Multi-modal Large Language Models
NeutralArtificial Intelligence
The paper titled 'Q-Doc: Benchmarking Document Image Quality Assessment Capabilities in Multi-modal Large Language Models' explores the underutilized potential of Multi-modal Large Language Models (MLLMs) in Document Image Quality Assessment (DIQA). It introduces a three-tiered evaluation framework that assesses MLLMs' capabilities at coarse, middle, and fine granularity levels. The study reveals that while MLLMs show early DIQA abilities, they face significant limitations, including inconsistent scoring and distortion misidentification.
Bridging Hidden States in Vision-Language Models
PositiveArtificial Intelligence
Vision-Language Models (VLMs) are emerging models that integrate visual content with natural language. Current methods typically fuse data either early in the encoding process or late through pooled embeddings. This paper introduces a lightweight fusion module utilizing cross-only, bidirectional attention layers to align hidden states from both modalities, enhancing understanding while keeping encoders non-causal. The proposed method aims to improve the performance of VLMs by leveraging the inherent structure of visual and textual data.
Bias-Restrained Prefix Representation Finetuning for Mathematical Reasoning
PositiveArtificial Intelligence
The paper titled 'Bias-Restrained Prefix Representation Finetuning for Mathematical Reasoning' introduces a new method called Bias-REstrained Prefix Representation FineTuning (BREP ReFT). This approach aims to enhance the mathematical reasoning capabilities of models by addressing the limitations of existing Representation finetuning (ReFT) methods, which struggle with mathematical tasks. The study demonstrates that BREP ReFT outperforms both standard ReFT and weight-based Parameter-Efficient finetuning (PEFT) methods through extensive experiments.
Transformers know more than they can tell -- Learning the Collatz sequence
NeutralArtificial Intelligence
The study investigates the ability of transformer models to predict long steps in the Collatz sequence, a complex arithmetic function that maps odd integers to their successors. The accuracy of the models varies significantly depending on the base used for encoding, achieving up to 99.7% accuracy for bases 24 and 32, while dropping to 37% and 25% for bases 11 and 3. Despite these variations, all models exhibit a common learning pattern, accurately predicting inputs with similar residuals modulo 2^p.
Thinker: Training LLMs in Hierarchical Thinking for Deep Search via Multi-Turn Interaction
PositiveArtificial Intelligence
The article presents Thinker, a hierarchical thinking model designed to enhance the reasoning capabilities of large language models (LLMs) through multi-turn interactions. Unlike previous methods that relied on end-to-end reinforcement learning without supervision, Thinker allows for a more structured reasoning process by breaking down complex problems into manageable sub-problems. Each sub-problem is represented in both natural language and logical functions, improving the coherence and rigor of the reasoning process.