arXiv:2511.20251v1 Announce Type: new 
Abstract: Recent advances in text-to-image (T2I) generation have achieved remarkable visual outcomes through large-scale rectified flow models. However, how these models behave under long prompts remains underexplored. Long prompts encode rich content, spatial, and stylistic information that enhances fidelity but often suppresses diversity, leading to repetitive and less creative outputs. In this work, we systematically study this fidelity-diversity dilemma and reveal that state-of-the-art models exhibit a clear drop in diversity as prompt length increases. To enable consistent evaluation, we introduce LPD-Bench, a benchmark designed for assessing both fidelity and diversity in long-prompt generation. Building on our analysis, we develop a theoretical framework that increases sampling entropy through prompt reformulation and propose a training-free method, PromptMoG, which samples prompt embeddings from a Mixture-of-Gaussians in the embedding space to enhance diversity while preserving semantics. Extensive experiments on four state-of-the-art models, SD3.5-Large, Flux.1-Krea-Dev, CogView4, and Qwen-Image, demonstrate that PromptMoG consistently improves long-prompt generation diversity without semantic drifting.

قدمت أبحاث حديثة طريقة PromptMoG، التي تهدف إلى تعزيز التنوع في توليد الصور ذات العبارات الطويلة من خلال استخدام تقنية أخذ العينات من مزيج من Gaussian. يتناول هذا التطور معضلة المصداقية والتنوع التي لوحظت في النماذج المتقدمة لتحويل النص إلى صورة، والتي تميل إلى إنتاج مخرجات أقل تنوعًا مع زيادة طول العبارة.

Una investigación reciente ha presentado PromptMoG, un método destinado a mejorar la diversidad en la generación de imágenes con prompts largos mediante una técnica de muestreo de mezcla de Gaussianas. Este desarrollo aborda el dilema de fidelidad-diversidad observado en los modelos de texto a imagen más avanzados, que tienden a producir salidas menos diversas a medida que aumenta la longitud del prompt.

Une recherche récente a introduit PromptMoG, une méthode visant à améliorer la diversité dans la génération d'images à long prompt en utilisant une technique d'échantillonnage par mélange de Gaussiennes. Ce développement aborde le dilemme fidélité-diversité observé dans les modèles d'images à partir de texte à la pointe de la technologie, qui tendent à produire des résultats moins diversifiés à mesure que la longueur du prompt augmente.

Recent research has introduced PromptMoG, a method aimed at enhancing diversity in long-prompt image generation by utilizing a Mixture-of-Gaussians sampling technique. This development addresses the fidelity-diversity dilemma observed in state-of-the-art text-to-image models, which tend to produce less diverse outputs as prompt length increases.

PromptMoG: Enhancing Diversity in Long-Prompt Image Generation via Prompt Embedding Mixture-of-Gaussian Sampling

arXiv:2512.02161v1 Announce Type: new 
Abstract: Text-to-image (T2I) models are capable of generating visually impressive images, yet they often fail to accurately capture specific attributes in user prompts, such as the correct number of objects with the specified colors. The diversity of such errors underscores the need for a hierarchical evaluation framework that can compare prompt adherence abilities of different image generation models. Simultaneously, benchmarks of vision language models (VLMs) have not kept pace with the complexity of scenes that VLMs are used to annotate. In this work, we propose a structured methodology for jointly evaluating T2I models and VLMs by testing whether VLMs can identify 27 specific failure modes in the images generated by T2I models conditioned on challenging prompts. Our second contribution is a dataset of prompts and images generated by 5 T2I models (Flux, SD3-Medium, SD3-Large, SD3.5-Medium, SD3.5-Large) and the corresponding annotations from VLMs (Molmo, InternVL3, Pixtral) annotated by an LLM (Llama3) to test whether VLMs correctly identify the failure mode in a generated image. By analyzing failure modes on a curated set of prompts, we reveal systematic errors in attribute fidelity and object representation. Our findings suggest that current metrics are insufficient to capture these nuanced errors, highlighting the importance of targeted benchmarks for advancing generative model reliability and interpretability.

قدمت FineGRAIN منهجية منظمة لتقييم أنماط الفشل في نماذج تحويل النص إلى صورة (T2I) باستخدام نماذج اللغة البصرية (VLM) كحكام. يهدف هذا النهج إلى تحديد الأخطاء المحددة في توليد الصور، مثل عدم الدقة في عدد الأجسام والألوان، من خلال اختبار 27 نمط فشل عبر خمسة نماذج T2I، بما في ذلك Flux وإصدارات مختلفة من SD3.

FineGRAIN ha introducido una metodología estructurada para evaluar los modos de falla en los modelos de generación de imágenes a partir de texto (T2I) utilizando modelos de lenguaje visual (VLM) como jueces. Este enfoque busca identificar errores específicos en la generación de imágenes, como inexactitudes en el conteo y color de objetos, al probar 27 modos de falla en cinco modelos T2I, incluyendo Flux y varias versiones de SD3.

FineGRAIN a introduit une méthodologie structurée pour évaluer les modes de défaillance des modèles de génération d'images à partir de texte (T2I) en utilisant des modèles de langage visuel (VLM) comme juges. Cette approche vise à identifier des erreurs spécifiques dans la génération d'images, telles que des inexactitudes dans le nombre d'objets et la couleur, en testant 27 modes de défaillance à travers cinq modèles T2I, y compris Flux et diverses versions de SD3.

FineGRAIN has introduced a structured methodology to evaluate failure modes in text-to-image (T2I) models using vision language models (VLMs) as judges. This approach aims to identify specific errors in image generation, such as inaccuracies in object count and color, by testing 27 failure modes across five T2I models, including Flux and various versions of SD3.

FineGRAIN: Evaluating Failure Modes of Text-to-Image Models with Vision Language Model Judges

arXiv:2511.22699v2 Announce Type: replace 
Abstract: The landscape of high-performance image generation models is currently dominated by proprietary systems, such as Nano Banana Pro and Seedream 4.0. Leading open-source alternatives, including Qwen-Image, Hunyuan-Image-3.0 and FLUX.2, are characterized by massive parameter counts (20B to 80B), making them impractical for inference, and fine-tuning on consumer-grade hardware. To address this gap, we propose Z-Image, an efficient 6B-parameter foundation generative model built upon a Scalable Single-Stream Diffusion Transformer (S3-DiT) architecture that challenges the "scale-at-all-costs" paradigm. By systematically optimizing the entire model lifecycle -- from a curated data infrastructure to a streamlined training curriculum -- we complete the full training workflow in just 314K H800 GPU hours (approx. $630K). Our few-step distillation scheme with reward post-training further yields Z-Image-Turbo, offering both sub-second inference latency on an enterprise-grade H800 GPU and compatibility with consumer-grade hardware (<16GB VRAM). Additionally, our omni-pre-training paradigm also enables efficient training of Z-Image-Edit, an editing model with impressive instruction-following capabilities. Both qualitative and quantitative experiments demonstrate that our model achieves performance comparable to or surpassing that of leading competitors across various dimensions. Most notably, Z-Image exhibits exceptional capabilities in photorealistic image generation and bilingual text rendering, delivering results that rival top-tier commercial models, thereby demonstrating that state-of-the-art results are achievable with significantly reduced computational overhead. We publicly release our code, weights, and online demo to foster the development of accessible, budget-friendly, yet state-of-the-art generative models.

تم تقديم Z-Image كنموذج فعال لتوليد الصور، يستخدم بنية مكونة من 6 مليارات معلمة تعتمد على Scalable Single-Stream Diffusion Transformer (S3-DiT). يهدف هذا النموذج إلى تحدي هيمنة الأنظمة المملوكة ذات المعلمات العالية مثل Nano Banana Pro وSeedream 4.0 من خلال توفير حل أكثر عملية للاستدلال والتعديل الدقيق على الأجهزة العادية.

Z-Image se ha presentado como un modelo de generación de imágenes eficiente, utilizando una arquitectura de 6 mil millones de parámetros basada en el Scalable Single-Stream Diffusion Transformer (S3-DiT). Este modelo busca desafiar el dominio de sistemas propietarios de alto parámetro como Nano Banana Pro y Seedream 4.0 al proporcionar una solución más práctica para la inferencia y el ajuste fino en hardware de consumo.

Z-Image a été introduit comme un modèle de génération d'images efficace, utilisant une architecture de 6 milliards de paramètres basée sur le Scalable Single-Stream Diffusion Transformer (S3-DiT). Ce modèle vise à défier la domination des systèmes propriétaires à haute échelle comme Nano Banana Pro et Seedream 4.0 en offrant une solution plus pratique pour l'inférence et le fine-tuning sur du matériel grand public.

Z-Image has been introduced as an efficient image generation foundation model, utilizing a 6B-parameter architecture based on the Scalable Single-Stream Diffusion Transformer (S3-DiT). This model aims to challenge the dominance of high-parameter proprietary systems like Nano Banana Pro and Seedream 4.0 by providing a more practical solution for inference and fine-tuning on consumer-grade hardware.

PromptMoG: Enhancing Diversity in Long-Prompt Image Generation via Prompt Embedding Mixture-of-Gaussian Sampling

Was this article worth reading? Share it

IMGFX.DEV

Glima

Shakker-ai