arXiv:2408.01826v5 Announce Type: replace 
Abstract: Speech-driven talking head generation is a critical yet challenging task with applications in augmented reality and virtual human modeling. While recent approaches using autoregressive and diffusion-based models have achieved notable progress, they often suffer from modality inconsistencies, particularly misalignment between audio and mesh, leading to reduced motion diversity and lip-sync accuracy. To address this, we propose GLDiTalker, a novel speech-driven 3D facial animation model based on a Graph Latent Diffusion Transformer. GLDiTalker resolves modality misalignment by diffusing signals within a quantized spatiotemporal latent space. It employs a two-stage training pipeline: the Graph-Enhanced Quantized Space Learning Stage ensures lip-sync accuracy, while the Space-Time Powered Latent Diffusion Stage enhances motion diversity. Together, these stages enable GLDiTalker to generate realistic, temporally stable 3D facial animations. Extensive evaluations on standard benchmarks demonstrate that GLDiTalker outperforms existing methods, achieving superior results in both lip-sync accuracy and motion diversity.

تم تقديم GLDiTalker كنموذج مبتكر لتوليد الرسوم المتحركة ثلاثية الأبعاد المدفوعة بالكلام، باستخدام محول الانتشار الكامن الرسومي لمعالجة التحديات المتعلقة بعدم التوافق بين الصوت والشبكة، مما يؤثر على دقة مزامنة الشفاه وتنوع الحركة. يستخدم النموذج خط أنابيب تدريب من مرحلتين لتحسين كل من دقة مزامنة الشفاه وتنوع الحركة، مما يمثل تقدمًا كبيرًا في تطبيقات الواقع المعزز ونمذجة البشر الافتراضيين.

GLDiTalker se ha presentado como un modelo innovador para la animación facial 3D impulsada por el habla, utilizando un Graph Latent Diffusion Transformer para abordar los desafíos del desajuste entre el audio y la malla, que afectan la precisión de la sincronización labial y la diversidad de movimientos. El modelo emplea un pipeline de entrenamiento en dos etapas para mejorar tanto la precisión de la sincronización labial como la variabilidad del movimiento, marcando un avance significativo en las aplicaciones de realidad aumentada y modelado humano virtual.

GLDiTalker a été présenté comme un modèle novateur pour l'animation faciale 3D pilotée par la parole, utilisant un Graph Latent Diffusion Transformer pour résoudre les problèmes de désalignement entre l'audio et le maillage, qui affectent la précision du synchronisme labial et la diversité des mouvements. Le modèle emploie un pipeline d'entraînement en deux étapes pour améliorer à la fois la précision du synchronisme labial et la variabilité des mouvements, marquant une avancée significative dans les applications de réalité augmentée et de modélisation humaine virtuelle.

GLDiTalker has been introduced as a novel model for speech-driven 3D facial animation, utilizing a Graph Latent Diffusion Transformer to address challenges in modality misalignment between audio and mesh, which affects lip-sync accuracy and motion diversity. The model employs a two-stage training pipeline to enhance both lip-sync precision and motion variability, marking a significant advancement in augmented reality and virtual human modeling applications.

GLDiTalker: Speech-Driven 3D Facial Animation with Graph Latent Diffusion Transformer

One More Thing in AI – Your Shortcut to AI Mastery

GLDiTalker: Speech-Driven 3D Facial Animation with Graph Latent Diffusion Transformer

Was this article worth reading? Share it

One More Thing in AI

LucidQuery AI

Humanize AI

Interactive Avatar

Synthesia

OmniTalker AI

Ready to build your own newsroom?