TEMPLE: Incentivizing Temporal Understanding of Video Large Language Models via Progressive Pre-SFT Alignment

The introduction of TEMPLE (TEMporal Preference LEarning) aims to enhance the temporal reasoning capabilities of Video Large Language Models (Video LLMs) through a systematic framework that utilizes Direct Preference Optimization (DPO). This approach addresses the challenges posed by weak temporal correspondence in data and the limitations of existing next-token prediction paradigms, which lack temporal supervision.
This development is significant as it represents a step forward in improving the performance of Video LLMs, particularly in understanding and processing temporal information in videos. By systematically constructing temporality-intensive preference pairs, TEMPLE seeks to refine the model's ability to reason about time, which is crucial for applications in video analysis and understanding.
The advancement of TEMPLE aligns with ongoing efforts in the AI community to enhance the efficiency and reliability of Video LLMs. Other frameworks, such as ShaRP and SEASON, also focus on optimizing model performance by addressing computational challenges and mitigating hallucinations. These developments reflect a broader trend in AI research aimed at improving model alignment with human preferences and enhancing the overall robustness of language models in various contexts.

TEMPLE: Incentivizing Temporal Understanding of Video Large Language Models via Progressive Pre-SFT Alignment