arXiv:2409.00449v2 Announce Type: replace 
Abstract: 2D-to-3D human pose lifting is an ill-posed problem due to depth ambiguity and occlusion. Existing methods relying on spatial and temporal consistency alone are insufficient to resolve these problems especially in the presence of significant occlusions or high dynamic actions. Semantic information, however, offers a complementary signal that can help disambiguate such cases. To this end, we propose LangPose, a framework that leverages action knowledge by aligning motion embeddings with text embeddings of fine-grained action labels. LangPose operates in two stages: pretraining and fine-tuning. In the pretraining stage, the model simultaneously learns to recognize actions and reconstruct 3D poses from masked and noisy 2D poses. During the fine-tuning stage, the model is further refined using real-world 3D human pose estimation datasets without action labels. Additionally, our framework incorporates masked body parts and masked time windows in motion modeling, encouraging the model to leverage semantic information when spatial and temporal consistency is unreliable. Experiments demonstrate the effectiveness of LangPose, achieving SOTA level performance in 3D pose estimation on public datasets, including Human3.6M and MPI-INF-3DHP. Specifically, LangPose achieves an MPJPE of 36.7mm on Human3.6M with detected 2D poses as input and 15.5mm on MPI-INF-3DHP with ground-truth 2D poses as input.

LangPose هو إطار جديد لتقدير وضعية الإنسان ثلاثية الأبعاد يتناول التحديات المتعلقة بتحويل الوضعيات من ثنائية الأبعاد إلى ثلاثية الأبعاد، خاصة في الحالات التي تتضمن انسدادات وحركات ديناميكية. من خلال محاذاة تمثيلات الحركة مع تمثيلات النص لتسميات الإجراءات، يعزز LangPose قدرة النموذج على تفسير الحركات المعقدة. تم إثبات فعاليته من خلال التجارب، حيث حقق أداءً رائدًا في مقاييس الأداء على مجموعات البيانات المرجعية.

LangPose es un nuevo marco para la estimación de la pose humana en 3D que aborda los desafíos del levantamiento de pose de 2D a 3D, especialmente en situaciones con oclusiones y acciones dinámicas. Al alinear las incrustaciones de movimiento con las incrustaciones de texto de etiquetas de acción, LangPose mejora la capacidad del modelo para interpretar movimientos complejos. Su efectividad se demuestra a través de experimentos, logrando métricas de rendimiento de vanguardia en conjuntos de datos de referencia.

LangPose est un nouveau cadre pour l'estimation de la pose humaine en 3D qui aborde les défis du passage de la pose 2D à la pose 3D, en particulier dans des situations avec des occlusions et des actions dynamiques. En alignant les embeddings de mouvement avec les embeddings de texte des étiquettes d'action, LangPose améliore la capacité du modèle à interpréter des mouvements complexes. Son efficacité est démontrée par des expériences, atteignant des performances de pointe sur des ensembles de données de référence.

LangPose is a new framework for 3D human pose estimation that addresses the challenges of 2D-to-3D pose lifting, particularly in situations with occlusions and dynamic actions. By aligning motion embeddings with text embeddings of action labels, LangPose enhances the model's ability to interpret complex movements. Its effectiveness is demonstrated through experiments, achieving state-of-the-art performance metrics on benchmark datasets.

LangPose: Language-Aligned Motion for Robust 3D Human Pose Estimation

Was this article worth reading? Share it

Ready to build your own newsroom?