arXiv:2508.03100v2 Announce Type: replace 
Abstract: Multimodal reasoning over long-horizon video is challenging due to the need for precise spatiotemporal fusion and alignment across modalities. While recent methods such as Group Relative Policy Optimization (GRPO) have shown promise in this domain, they suffer from three key limitations: (1) data inefficiency from their on-policy design, (2) a vanishing advantage problem, where identical or near-identical rewards within a group eliminate the learning signal by producing zero-valued advantages, and (3) uniform credit assignment that fails to emphasize critical reasoning steps. We introduce $\textbf{AVATAR}$ ($\textbf{A}$udio-$\textbf{V}$ideo $\textbf{A}$gen$\textbf{t}$ for $\textbf{A}$lignment and $\textbf{R}$easoning), a framework that addresses these limitations through two core components: (1) an off-policy training architecture that improves sample efficiency and resolves vanishing advantages by reusing past experiences with greater reward diversity, and (2) Temporal Advantage Shaping (TAS), a novel credit assignment strategy that upweights key reasoning phases during learning. $\textbf{AVATAR}$ achieves strong performance across various benchmarks, outperforming the Qwen2.5-Omni baseline by $\mathbf{+5.4}$ on MMVU, $\mathbf{+4.9}$ on OmniBench, and $\mathbf{+4.5}$ on Video-Holmes, while demonstrating $\textbf{$5$$\times$ sample efficiency}$, requiring $80\%$ fewer generated completions to reach target performance.

تم تقديم AVATAR، وهو إطار جديد للتعلم المعزز، بهدف تحسين التفكير متعدد الوسائط على الفيديوهات طويلة الأمد من خلال معالجة القيود الرئيسية للطرق الحالية مثل تحسين السياسة النسبية الجماعية (GRPO). يحسن AVATAR كفاءة العينة ويحل مشاكل مثل تلاشي المزايا والتخصيص الموحد للائتمانات من خلال بنية تدريب خارج السياسة.

La introducción de AVATAR, un nuevo marco para el aprendizaje por refuerzo, tiene como objetivo mejorar el razonamiento multimodal sobre videos de largo plazo al abordar las limitaciones clave de los métodos existentes como la Optimización de Políticas Relativas de Grupo (GRPO). AVATAR mejora la eficiencia de las muestras y resuelve problemas como las ventajas que desaparecen y la asignación uniforme de créditos a través de una arquitectura de entrenamiento fuera de política.

L'introduction d'AVATAR, un nouveau cadre d'apprentissage par renforcement, vise à améliorer le raisonnement multimodal sur des vidéos de longue durée en abordant les principales limitations des méthodes existantes telles que l'optimisation de politique relative de groupe (GRPO). AVATAR améliore l'efficacité des échantillons et résout des problèmes tels que les avantages qui s'estompent et l'attribution uniforme des crédits grâce à une architecture d'entraînement hors politique.

The introduction of AVATAR, a novel framework for reinforcement learning, aims to enhance multimodal reasoning over long-horizon video by addressing key limitations of existing methods like Group Relative Policy Optimization (GRPO). AVATAR improves sample efficiency and resolves issues such as vanishing advantages and uniform credit assignment through an off-policy training architecture.

AVATAR: Reinforcement Learning to See, Hear, and Reason Over Video

arXiv:2511.17300v1 Announce Type: new 
Abstract: Optical Chemical Structure Recognition (OCSR) plays a pivotal role in modern chemical informatics, enabling the automated conversion of chemical structure images from scientific literature, patents, and educational materials into machine-readable molecular representations. This capability is essential for large-scale chemical data mining, drug discovery pipelines, and Large Language Model (LLM) applications in related domains. However, existing OCSR systems face significant challenges in accurately recognizing stereochemical information due to the subtle visual cues that distinguish stereoisomers, such as wedge and dash bonds, ring conformations, and spatial arrangements. To address these challenges, we propose MolSight, a comprehensive learning framework for OCSR that employs a three-stage training paradigm. In the first stage, we conduct pre-training on large-scale but noisy datasets to endow the model with fundamental perception capabilities for chemical structure images. In the second stage, we perform multi-granularity fine-tuning using datasets with richer supervisory signals, systematically exploring how auxiliary tasks-specifically chemical bond classification and atom localization-contribute to molecular formula recognition. Finally, we employ reinforcement learning for post-training optimization and introduce a novel stereochemical structure dataset. Remarkably, we find that even with MolSight's relatively compact parameter size, the Group Relative Policy Optimization (GRPO) algorithm can further enhance the model's performance on stereomolecular. Through extensive experiments across diverse datasets, our results demonstrate that MolSight achieves state-of-the-art performance in (stereo)chemical optical structure recognition.

تم تقديم MolSight كإطار جديد للتعرف البصري على الهياكل الكيميائية (OCSR)، حيث يتناول التحديات المتعلقة بتفسير المعلومات الاستريو كيميائية بدقة من صور الهياكل الكيميائية. يستخدم هذا النظام نهج تدريب من ثلاث مراحل، مما يعزز قدرة النموذج على تحويل البيانات المرئية إلى تنسيقات قابلة للقراءة بواسطة الآلة، وهو أمر أساسي للمعلومات الكيميائية.

MolSight se ha presentado como un nuevo marco para el reconocimiento óptico de estructuras químicas (OCSR), abordando los desafíos de interpretar con precisión la información estereoquímica a partir de imágenes de estructuras químicas. Este sistema emplea un enfoque de entrenamiento en tres etapas, mejorando la capacidad del modelo para convertir datos visuales en formatos legibles por máquina, esenciales para la informática química.

MolSight a été introduit comme un nouveau cadre pour la reconnaissance optique de structures chimiques (OCSR), répondant aux défis d'interprétation précise des informations stéréochimiques à partir d'images de structures chimiques. Ce système utilise une approche de formation en trois étapes, améliorant la capacité du modèle à convertir des données visuelles en formats lisibles par machine, essentiels pour l'informatique chimique.

MolSight has been introduced as a novel framework for Optical Chemical Structure Recognition (OCSR), addressing the challenges of accurately interpreting stereochemical information from chemical structure images. This system employs a three-stage training approach, enhancing the model's ability to convert visual data into machine-readable formats essential for chemical informatics.

MolSight: Optical Chemical Structure Recognition with SMILES Pretraining, Multi-Granularity Learning and Reinforcement Learning

arXiv:2511.16955v1 Announce Type: cross 
Abstract: Group Relative Policy Optimization (GRPO) has shown promise in aligning image and video generative models with human preferences. However, applying it to modern flow matching models is challenging because of its deterministic sampling paradigm. Current methods address this issue by converting Ordinary Differential Equations (ODEs) to Stochastic Differential Equations (SDEs), which introduce stochasticity. However, this SDE-based GRPO suffers from issues of inefficient credit assignment and incompatibility with high-order solvers for fewer-step sampling. In this paper, we first reinterpret existing SDE-based GRPO methods from a distance optimization perspective, revealing their underlying mechanism as a form of contrastive learning. Based on this insight, we propose Neighbor GRPO, a novel alignment algorithm that completely bypasses the need for SDEs. Neighbor GRPO generates a diverse set of candidate trajectories by perturbing the initial noise conditions of the ODE and optimizes the model using a softmax distance-based surrogate leaping policy. We establish a theoretical connection between this distance-based objective and policy gradient optimization, rigorously integrating our approach into the GRPO framework. Our method fully preserves the advantages of deterministic ODE sampling, including efficiency and compatibility with high-order solvers. We further introduce symmetric anchor sampling for computational efficiency and group-wise quasi-norm reweighting to address reward flattening. Extensive experiments demonstrate that Neighbor GRPO significantly outperforms SDE-based counterparts in terms of training cost, convergence speed, and generation quality.

يمثل تقديم Neighbor Group Relative Policy Optimization (GRPO) تقدمًا كبيرًا في مواءمة نماذج التدفق مع تفضيلات البشر من خلال القضاء على الحاجة إلى المعادلات التفاضلية العشوائية (SDEs). يقوم هذا الخوارزم الجديد بإنشاء مجموعة متنوعة من المسارات المرشحة من خلال الاضطراب، مما يحسن كفاءة عملية المواءمة.

La introducción de Neighbor Group Relative Policy Optimization (GRPO) representa un avance significativo en la alineación de modelos de flujo con las preferencias humanas al eliminar la necesidad de Ecuaciones Diferenciales Estocásticas (EDEs). Este nuevo algoritmo genera un conjunto diverso de trayectorias candidatas mediante perturbaciones, mejorando así la eficiencia del proceso de alineación.

L'introduction de Neighbor Group Relative Policy Optimization (GRPO) représente une avancée significative dans l'alignement des modèles de flux avec les préférences humaines en éliminant le besoin d'Équations Différentielles Stochastiques (EDS). Cet algorithme novateur génère des trajectoires candidates diversifiées par perturbation, améliorant ainsi l'efficacité du processus d'alignement.

The introduction of Neighbor Group Relative Policy Optimization (GRPO) presents a significant advancement in aligning flow models with human preferences by eliminating the need for Stochastic Differential Equations (SDEs). This novel algorithm generates diverse candidate trajectories through perturbation, enhancing the efficiency of the alignment process.

AVATAR: Reinforcement Learning to See, Hear, and Reason Over Video

Was this article worth reading? Share it