arXiv:2503.17352v3 Announce Type: replace 
Abstract: We introduce OpenVLThinker, one of the first open-source large vision-language models (LVLMs) to exhibit sophisticated chain-of-thought reasoning, achieving notable performance gains on challenging visual reasoning tasks. While text-based reasoning models (e.g., Deepseek R1) show promising results in text-only tasks, distilling their reasoning into LVLMs via supervised fine-tuning (SFT) often results in performance degradation due to imprecise visual grounding. Conversely, purely reinforcement learning (RL)-based methods face a large search space, hindering the emergence of reflective behaviors in smaller models (e.g., 7B LVLMs). Surprisingly, alternating between SFT and RL ultimately results in significant performance improvements after a few iterations. Our analysis reveals that the base model rarely exhibits reasoning behaviors initially, but SFT effectively surfaces these latent actions and narrows the RL search space, accelerating the development of reasoning capabilities. Each subsequent RL stage further refines the model's reasoning skills, producing higher-quality SFT data for continued self-improvement. OpenVLThinker-7B consistently advances performance across six benchmarks demanding mathematical and general reasoning, notably improving MathVista by 3.8%, EMMA by 2.4%, and HallusionBench by 1.6%. Beyond demonstrating the synergy between SFT and RL for complex reasoning tasks, our findings provide early evidence towards achieving R1-style reasoning in multimodal contexts. The code, model and data are held at https://github.com/yihedeng9/OpenVLThinker.

يظهر OpenVLThinker، وهو نموذج جديد مفتوح المصدر للغة والرؤية، قدرات متقدمة في التفكير، محققًا تحسينات ملحوظة في الأداء في مهام التفكير البصري. من خلال التناوب بين الضبط الدقيق تحت الإشراف والتعلم المعزز، يحسن النموذج مهاراته في التفكير، متفوقًا على نماذج سابقة مثل Deepseek R1. يعد هذا التطور مهمًا لأنه يمثل خطوة للأمام في دمج معالجة الصور والنصوص، مما قد يؤثر على تطبيقات الذكاء الاصطناعي المختلفة.

OpenVLThinker, un nuevo modelo de lenguaje y visión de código abierto, demuestra capacidades avanzadas de razonamiento, logrando mejoras significativas en el rendimiento en tareas de razonamiento visual. Al alternar entre el ajuste fino supervisado y el aprendizaje por refuerzo, el modelo mejora sus habilidades de razonamiento, superando a modelos anteriores como Deepseek R1. Este desarrollo es crucial, ya que marca un avance en la integración del procesamiento visual y lingüístico, lo que podría impactar diversas aplicaciones de IA.

OpenVLThinker, un nouveau modèle de langage et de vision open-source, démontre des capacités de raisonnement avancées, réalisant des améliorations significatives de performance sur des tâches de raisonnement visuel. En alternant entre l'affinage supervisé et l'apprentissage par renforcement, le modèle améliore ses compétences en raisonnement, surpassant des modèles précédents comme Deepseek R1. Ce développement est crucial car il marque une avancée dans l'intégration du traitement visuel et linguistique, impactant potentiellement diverses applications de l'IA.

OpenVLThinker, a new open-source large vision-language model, demonstrates advanced reasoning capabilities, achieving significant performance improvements on visual reasoning tasks. By alternating between supervised fine-tuning and reinforcement learning, the model enhances its reasoning skills, outperforming previous models like Deepseek R1. This development is crucial as it marks a step forward in integrating visual and language processing, potentially impacting various AI applications.

OpenVLThinker: Complex Vision-Language Reasoning via Iterative SFT-RL Cycles

Was this article worth reading? Share it

Ready to build your own newsroom?