arXiv:2512.04810v4 Announce Type: replace 
Abstract: We propose EMMA, an efficient and unified architecture for multimodal understanding, generation and editing. Specifically, EMMA primarily consists of 1) An efficient autoencoder with a 32x compression ratio, which significantly reduces the number of tokens required for generation. This also ensures the training balance between understanding and generation tasks by applying the same compression ratio to images. 2) Channel-wise concatenation instead of token-wise concatenation among visual understanding and generation tokens, which further reduces the visual tokens in unified architectures. 3) A shared-and-decoupled network that enables mutual improvements across tasks while meeting the task-specific modeling requirements. 4) A mixture-of-experts mechanism adopted for visual understanding encoder, which substantially improves perceptual capabilities with a few parameters increase. Extensive experiments have shown that EMMA-4B can significantly outperform state-of-the-art unified multimodal approaches (e.g., BAGEL-7B) in both efficiency and performance, while also achieving competitive results compared to recent multimodal understanding and generation experts (e.g., Qwen3-VL and Qwen-Image). We believe that EMMA lays a solid foundation for the future development of unified multimodal architectures.

تم تقديم EMMA كهيكل موحد وفعال مصمم لفهم وتوليد وتحرير البيانات متعددة الوسائط، مع نسبة ضغط تبلغ 32 ضعفًا في جهاز التشفير التلقائي الخاص به، مما يحسن استخدام الرموز لمهام الصور والنصوص. كما تستخدم البنية أيضًا الدمج حسب القنوات وشبكة مشتركة ومفصولة لتحسين أداء المهام.

EMMA se ha presentado como una arquitectura eficiente y unificada diseñada para la comprensión, generación y edición multimodal, con una relación de compresión de 32x en su autoencoder, que optimiza el uso de tokens para tareas de imagen y texto. La arquitectura también emplea concatenación por canales y una red compartida y desacoplada para mejorar el rendimiento de las tareas.

EMMA a été introduit comme une architecture efficace et unifiée conçue pour la compréhension, la génération et l'édition multimodales, avec un ratio de compression de 32x dans son autoencodeur, optimisant l'utilisation des tokens pour les tâches d'image et de texte. L'architecture utilise également la concaténation par canaux et un réseau partagé et découplé pour améliorer les performances des tâches.

EMMA has been introduced as an efficient and unified architecture designed for multimodal understanding, generation, and editing, featuring a 32x compression ratio in its autoencoder, which optimizes token usage for both image and text tasks. The architecture also employs channel-wise concatenation and a shared-and-decoupled network to enhance task performance.

EMMA: Efficient Multimodal Understanding, Generation, and Editing with a Unified Architecture

arXiv:2511.22699v3 Announce Type: replace 
Abstract: The landscape of high-performance image generation models is currently dominated by proprietary systems, such as Nano Banana Pro and Seedream 4.0. Leading open-source alternatives, including Qwen-Image, Hunyuan-Image-3.0 and FLUX.2, are characterized by massive parameter counts (20B to 80B), making them impractical for inference, and fine-tuning on consumer-grade hardware. To address this gap, we propose Z-Image, an efficient 6B-parameter foundation generative model built upon a Scalable Single-Stream Diffusion Transformer (S3-DiT) architecture that challenges the "scale-at-all-costs" paradigm. By systematically optimizing the entire model lifecycle -- from a curated data infrastructure to a streamlined training curriculum -- we complete the full training workflow in just 314K H800 GPU hours (approx. $630K). Our few-step distillation scheme with reward post-training further yields Z-Image-Turbo, offering both sub-second inference latency on an enterprise-grade H800 GPU and compatibility with consumer-grade hardware (<16GB VRAM). Additionally, our omni-pre-training paradigm also enables efficient training of Z-Image-Edit, an editing model with impressive instruction-following capabilities. Both qualitative and quantitative experiments demonstrate that our model achieves performance comparable to or surpassing that of leading competitors across various dimensions. Most notably, Z-Image exhibits exceptional capabilities in photorealistic image generation and bilingual text rendering, delivering results that rival top-tier commercial models, thereby demonstrating that state-of-the-art results are achievable with significantly reduced computational overhead. We publicly release our code, weights, and online demo to foster the development of accessible, budget-friendly, yet state-of-the-art generative models.

تم تقديم Z-Image، وهو نموذج توليدي يحتوي على 6 مليارات معلمة ويستخدم بنية محول انتشار التدفق الفردي القابلة للتوسع (S3-DiT)، بهدف توفير بديل فعال لنماذج توليد الصور عالية الأداء الموجودة مثل Nano Banana Pro وSeedream 4.0، والتي تتميز بأعدادها الضخمة من المعلمات.

La introducción de Z-Image, un modelo generativo de 6 mil millones de parámetros que utiliza una arquitectura de transformador de difusión de flujo único escalable (S3-DiT), tiene como objetivo proporcionar una alternativa eficiente a los modelos de generación de imágenes de alto rendimiento existentes, como Nano Banana Pro y Seedream 4.0, que se caracterizan por sus enormes recuentos de parámetros.

L'introduction de Z-Image, un modèle génératif de 6 milliards de paramètres utilisant une architecture de transformateur de diffusion à flux unique évolutif (S3-DiT), vise à fournir une alternative efficace aux modèles de génération d'images haute performance existants tels que Nano Banana Pro et Seedream 4.0, qui se caractérisent par leurs énormes nombres de paramètres.

The introduction of Z-Image, a 6B-parameter generative model utilizing a Scalable Single-Stream Diffusion Transformer (S3-DiT) architecture, aims to provide an efficient alternative to existing high-performance image generation models like Nano Banana Pro and Seedream 4.0, which are characterized by their massive parameter counts.

EMMA: Efficient Multimodal Understanding, Generation, and Editing with a Unified Architecture

Was this article worth reading? Share it

Emma: AI Food Scanner

Magicley AI

Https