arXiv:2407.16344v5 Announce Type: replace 
Abstract: High frame-rate (HFR) videos of action recognition improve fine-grained expression while reducing the spatio-temporal relation and motion information density. Thus, large amounts of video samples are continuously required for traditional data-driven training. However, samples are not always sufficient in real-world scenarios, promoting few-shot action recognition (FSAR) research. We observe that most recent FSAR works build spatio-temporal relation of video samples via temporal alignment after spatial feature extraction, cutting apart spatial and temporal features within samples. They also capture motion information via narrow perspectives between adjacent frames without considering density, leading to insufficient motion information capturing. Therefore, we propose a novel plug-and-play architecture for FSAR called Spatio-tempOral frAme tuPle enhancer (SOAP) in this paper. The model we designed with such architecture refers to SOAP-Net. Temporal connections between different feature channels and spatio-temporal relation of features are considered instead of simple feature extraction. Comprehensive motion information is also captured, using frame tuples with multiple frames containing more motion information than adjacent frames. Combining frame tuples of diverse frame counts further provides a broader perspective. SOAP-Net achieves new state-of-the-art performance across well-known benchmarks such as SthSthV2, Kinetics, UCF101, and HMDB51. Extensive empirical evaluations underscore the competitiveness, pluggability, generalization, and robustness of SOAP. The code is released at https://github.com/wenbohuang1002/SOAP.

تم اقتراح بنية جديدة تُدعى SOAP (محسّن إطار الزمان والمكان) لتحسين التعرف على الأفعال القليلة العينات (FSAR) من خلال تعزيز التقاط العلاقات الزمانية والمكانية ومعلومات الحركة في مقاطع الفيديو عالية الإطار. تتناول هذه النموذج قيود طرق التدريب التقليدية المعتمدة على البيانات، التي غالبًا ما تتطلب كميات كبيرة من عينات الفيديو التي قد لا تكون متاحة دائمًا في السيناريوهات الواقعية.

Se ha propuesto una nueva arquitectura llamada SOAP (Spatio-tempOral frAme tuPle enhancer) para mejorar el reconocimiento de acciones con pocos ejemplos (FSAR) al mejorar la captura de relaciones espaciotemporales e información de movimiento en videos de alta frecuencia de cuadros. Este modelo aborda las limitaciones de los métodos de entrenamiento tradicionales basados en datos, que a menudo requieren grandes cantidades de muestras de video que no siempre están disponibles en escenarios del mundo real.

Une nouvelle architecture nommée SOAP (Spatio-tempOral frAme tuPle enhancer) a été proposée pour améliorer la reconnaissance d'actions à faible échantillonnage (FSAR) en renforçant la capture des relations spatio-temporelles et des informations de mouvement dans des vidéos à haute fréquence d'images. Ce modèle répond aux limites des méthodes d'entraînement traditionnelles basées sur les données, qui nécessitent souvent de grandes quantités d'échantillons vidéo, souvent indisponibles dans des scénarios réels.

A novel architecture named SOAP (Spatio-tempOral frAme tuPle enhancer) has been proposed to improve few-shot action recognition (FSAR) by enhancing the capturing of spatio-temporal relations and motion information in high frame-rate videos. This model addresses the limitations of traditional data-driven training methods, which often require large amounts of video samples that are not always available in real-world scenarios.

SOAP: Enhancing Spatio-Temporal Relation and Motion Information Capturing for Few-Shot Action Recognition

Was this article worth reading? Share it

Fakeface

BlitzToksAI

ClipCutAi

AiReelGenerator.com

Framenet ai

Capte

Ready to build your own newsroom?