arXiv:2508.01533v2 Announce Type: replace 
Abstract: While recent multimodal models have shown progress in vision-language tasks, small-scale variants still struggle with the fine-grained temporal reasoning required for video understanding. We introduce ReasonAct, a method that enhances video reasoning in smaller models through a three-stage training process: first building a foundation with text-only reasoning, then fine-tuning on video, and finally refining with temporal-aware reinforcement learning. We build upon Temporal Group Relative Policy Optimization (T-GRPO) by incorporating temporal consistency modeling into policy optimization. We also propose a biomechanically-motivated sub-action decomposition mechanism that provides graduated rewards for constituent action phases. Through experiments on HMDB51, UCF-101, and Kinetics-400, our 3B-parameter model achieves 67.2%, 94.1%, and 78.9% accuracy respectively, demonstrating improvements of 17.9, 15.8, and 12.3 points over baselines. Ablation studies validate that our progressive training enables smaller models to achieve competitive video reasoning performance while maintaining computational efficiency.

تم تقديم ReasonAct كطريقة جديدة لتعزيز التفكير الدقيق في الفيديو في النماذج الصغيرة من خلال عملية تدريب منظمة من ثلاث مراحل، تشمل التفكير النصي فقط، ثم ضبط الفيديو، وأخيرًا تحسين التعلم المعزز. تهدف هذه الطريقة إلى معالجة قيود النماذج متعددة الوسائط صغيرة الحجم في فهم محتوى الفيديو المعقد.

ReasonAct se ha presentado como un método novedoso para mejorar el razonamiento de video de precisión en modelos pequeños a través de un proceso de entrenamiento estructurado en tres etapas, que incluye razonamiento solo con texto, ajuste fino en video y aprendizaje por refuerzo. Este enfoque busca abordar las limitaciones de los modelos multimodales a pequeña escala en la comprensión de contenido de video complejo.

ReasonAct a été introduit comme une méthode novatrice pour améliorer le raisonnement vidéo de précision dans de petits modèles grâce à un processus d'entraînement structuré en trois étapes, qui comprend le raisonnement uniquement textuel, le réglage vidéo et l'apprentissage par renforcement. Cette approche vise à surmonter les limites des modèles multimodaux à petite échelle dans la compréhension de contenus vidéo complexes.

ReasonAct has been introduced as a novel method to enhance fine-grained video reasoning in small models through a structured three-stage training process, which includes text-only reasoning, video fine-tuning, and reinforcement learning. This approach aims to address the limitations of small-scale multimodal models in understanding complex video content.

ReasonAct: Progressive Training for Fine-Grained Video Reasoning in Small Models

One More Thing in AI – Your Shortcut to AI Mastery

ReasonAct: Progressive Training for Fine-Grained Video Reasoning in Small Models

Was this article worth reading? Share it

One More Thing in AI

LucidQuery AI

sync. labs

Synthesia

AiReelGenerator.com

The Visualizer

Ready to build your own newsroom?