arXiv:2503.18559v3 Announce Type: replace 
Abstract: Text-to-Video (T2V) generation has attracted significant attention for its ability to synthesize realistic videos from textual descriptions. However, existing models struggle to balance computational efficiency and high visual quality, particularly on resource-limited devices, e.g.,iGPUs and mobile phones. Most prior work prioritizes visual fidelity while overlooking the need for smaller, more efficient models suitable for real-world deployment. To address this challenge, we propose a lightweight T2V framework, termed Hummingbird, which prunes existing models and enhances visual quality through visual feedback learning. Our approach reduces the size of the U-Net from 1.4 billion to 0.7 billion parameters, significantly improving efficiency while preserving high-quality video generation. Additionally, we introduce a novel data processing pipeline that leverages Large Language Models (LLMs) and Video Quality Assessment (VQA) models to enhance the quality of both text prompts and video data. To support user-driven training and style customization, we publicly release the full training code, including data processing and model training. Extensive experiments show that our method achieves a 31X speedup compared to state-of-the-art models such as VideoCrafter2, while also attaining the highest overall score on VBench. Moreover, our method supports the generation of videos with up to 26 frames, addressing the limitations of existing U-Net-based methods in long video generation. Notably, the entire training process requires only four GPUs, yet delivers performance competitive with existing leading methods. Hummingbird presents a practical and efficient solution for T2V generation, combining high performance, scalability, and flexibility for real-world applications.

نموذج هامينغبيرد الجديد من AMD يثير ضجة في مجال توليد الفيديو من النص (T2V)، وهو أمر حاسم لإنشاء مقاطع فيديو واقعية من الأوصاف النصية. هذه الابتكار يعالج تحديًا كبيرًا في الصناعة: تحقيق التوازن بين الجودة البصرية العالية والكفاءة الحاسوبية، خاصةً للأجهزة ذات الموارد المحدودة مثل الهواتف المحمولة. من خلال التركيز على نماذج أصغر وأكثر كفاءة، تمهد AMD الطريق لتطبيقات عملية لتكنولوجيا T2V، مما يجعلها أكثر سهولة للاستخدام اليومي.

El nuevo modelo Hummingbird de AMD está causando revuelo en el campo de la generación de videos a partir de texto (T2V), que es crucial para crear videos realistas a partir de descripciones textuales. Esta innovación aborda un desafío significativo en la industria: equilibrar la alta calidad visual con la eficiencia computacional, especialmente para dispositivos con recursos limitados como los teléfonos móviles. Al centrarse en modelos más pequeños y eficientes, AMD está allanando el camino para aplicaciones prácticas de la tecnología T2V, haciéndola más accesible para el uso diario.

Le nouveau modèle Hummingbird d'AMD fait sensation dans le domaine de la génération de vidéos à partir de texte (T2V), essentiel pour créer des vidéos réalistes à partir de descriptions textuelles. Cette innovation répond à un défi majeur de l'industrie : équilibrer une haute qualité visuelle avec une efficacité computationnelle, en particulier pour les appareils aux ressources limitées comme les téléphones mobiles. En se concentrant sur des modèles plus petits et plus efficaces, AMD ouvre la voie à des applications pratiques de la technologie T2V, la rendant plus accessible pour un usage quotidien.

AMD's new Hummingbird model is making waves in the field of Text-to-Video (T2V) generation, which is crucial for creating realistic videos from text. This innovation addresses a significant challenge in the industry: balancing high visual quality with computational efficiency, especially for devices with limited resources like mobile phones. By focusing on smaller, more efficient models, AMD is paving the way for practical applications of T2V technology, making it more accessible for everyday use.

AMD-Hummingbird: Towards an Efficient Text-to-Video Model

arXiv:2601.06874v2 Announce Type: replace 
Abstract: Most existing 3D referring expression segmentation (3DRES) methods rely on dense, high-quality point clouds, while real-world agents such as robots and mobile phones operate with only a few sparse RGB views and strict latency constraints. We introduce Multi-view 3D Referring Expression Segmentation (MV-3DRES), where the model must recover scene structure and segment the referred object directly from sparse multi-view images. Traditional two-stage pipelines, which first reconstruct a point cloud and then perform segmentation, often yield low-quality geometry, produce coarse or degraded target regions, and run slowly. We propose the Multimodal Visual Geometry Grounded Transformer (MVGGT), an efficient end-to-end framework that integrates language information into sparse-view geometric reasoning through a dual-branch design. Training in this setting exposes a critical optimization barrier, termed Foreground Gradient Dilution (FGD), where sparse 3D signals lead to weak supervision. To resolve this, we introduce Per-view No-target Suppression Optimization (PVSO), which provides stronger and more balanced gradients across views, enabling stable and efficient learning. To support consistent evaluation, we build MVRefer, a benchmark that defines standardized settings and metrics for MV-3DRES. Experiments show that MVGGT establishes the first strong baseline and achieves both high accuracy and fast inference, outperforming existing alternatives. Code and models are publicly available at https://mvggt.github.io.

تم تقديم المحول متعدد الوسائط للجيوميتري البصرية (MVGGT) كإطار جديد لتجزئة التعبير المرجعي ثلاثي الأبعاد متعدد المناظر (MV-3DRES)، حيث يعالج القيود المفروضة على الأساليب الحالية التي تعتمد على سحب النقاط الكثيفة. يسمح MVGGT بالتجزئة مباشرة من الصور متعددة المناظر النادرة، مما يعزز الكفاءة والأداء في التطبيقات الواقعية.

Se ha presentado el Transformador Multimodal de Geometría Visual (MVGGT) como un nuevo marco para la Segmentación de Expresión Referencial 3D Multivista (MV-3DRES), abordando las limitaciones de los métodos existentes que dependen de nubes de puntos densas. MVGGT permite la segmentación directamente a partir de imágenes multivista escasas, mejorando la eficiencia y el rendimiento en aplicaciones del mundo real.

Le Transformateur Multimodal de Géométrie Visuelle (MVGGT) a été introduit comme un cadre novateur pour la Segmentation d'Expression Référentielle 3D Multivue (MV-3DRES), répondant aux limitations des méthodes existantes qui dépendent de nuages de points denses. MVGGT permet la segmentation directement à partir d'images multivues éparses, améliorant l'efficacité et la performance dans des applications réelles.

The Multimodal Visual Geometry Grounded Transformer (MVGGT) has been introduced as a novel framework for Multiview 3D Referring Expression Segmentation (MV-3DRES), addressing the limitations of existing methods that depend on dense point clouds. MVGGT enables segmentation directly from sparse multi-view images, enhancing efficiency and performance in real-world applications.

MVGGT: Multimodal Visual Geometry Grounded Transformer for Multiview 3D Referring Expression Segmentation

One More Thing in AI – Your Shortcut to AI Mastery

AMD-Hummingbird: Towards an Efficient Text-to-Video Model

Was this article worth reading? Share it

One More Thing in AI

Humanize AI

Rendora AI

Videolulu

Videotok

Veo 2

Ready to build your own newsroom?