arXiv:2512.02161v1 Announce Type: new 
Abstract: Text-to-image (T2I) models are capable of generating visually impressive images, yet they often fail to accurately capture specific attributes in user prompts, such as the correct number of objects with the specified colors. The diversity of such errors underscores the need for a hierarchical evaluation framework that can compare prompt adherence abilities of different image generation models. Simultaneously, benchmarks of vision language models (VLMs) have not kept pace with the complexity of scenes that VLMs are used to annotate. In this work, we propose a structured methodology for jointly evaluating T2I models and VLMs by testing whether VLMs can identify 27 specific failure modes in the images generated by T2I models conditioned on challenging prompts. Our second contribution is a dataset of prompts and images generated by 5 T2I models (Flux, SD3-Medium, SD3-Large, SD3.5-Medium, SD3.5-Large) and the corresponding annotations from VLMs (Molmo, InternVL3, Pixtral) annotated by an LLM (Llama3) to test whether VLMs correctly identify the failure mode in a generated image. By analyzing failure modes on a curated set of prompts, we reveal systematic errors in attribute fidelity and object representation. Our findings suggest that current metrics are insufficient to capture these nuanced errors, highlighting the importance of targeted benchmarks for advancing generative model reliability and interpretability.

قدمت FineGRAIN منهجية منظمة لتقييم أنماط الفشل في نماذج تحويل النص إلى صورة (T2I) باستخدام نماذج اللغة البصرية (VLM) كحكام. يهدف هذا النهج إلى تحديد الأخطاء المحددة في توليد الصور، مثل عدم الدقة في عدد الأجسام والألوان، من خلال اختبار 27 نمط فشل عبر خمسة نماذج T2I، بما في ذلك Flux وإصدارات مختلفة من SD3.

FineGRAIN ha introducido una metodología estructurada para evaluar los modos de falla en los modelos de generación de imágenes a partir de texto (T2I) utilizando modelos de lenguaje visual (VLM) como jueces. Este enfoque busca identificar errores específicos en la generación de imágenes, como inexactitudes en el conteo y color de objetos, al probar 27 modos de falla en cinco modelos T2I, incluyendo Flux y varias versiones de SD3.

FineGRAIN a introduit une méthodologie structurée pour évaluer les modes de défaillance des modèles de génération d'images à partir de texte (T2I) en utilisant des modèles de langage visuel (VLM) comme juges. Cette approche vise à identifier des erreurs spécifiques dans la génération d'images, telles que des inexactitudes dans le nombre d'objets et la couleur, en testant 27 modes de défaillance à travers cinq modèles T2I, y compris Flux et diverses versions de SD3.

FineGRAIN has introduced a structured methodology to evaluate failure modes in text-to-image (T2I) models using vision language models (VLMs) as judges. This approach aims to identify specific errors in image generation, such as inaccuracies in object count and color, by testing 27 failure modes across five T2I models, including Flux and various versions of SD3.

FineGRAIN: Evaluating Failure Modes of Text-to-Image Models with Vision Language Model Judges

arXiv:2508.12726v5 Announce Type: replace 
Abstract: Large language models (LLMs) have achieved remarkable success in many natural language tasks but still struggle with complex, multi-step reasoning, particularly across diverse disciplines. Existing reasoning datasets often lack disciplinary breadth, reasoning depth, and diversity, as well as guiding principles for question synthesis. We propose DESIGNER: a DESIGN-logic-guidEd Reasoning data synthesis pipeline that leverages naturally available, extensive raw documents (e.g., book corpus and web corpus) to generate multidisciplinary challenging questions. We introduce the concept of "design logic" and instruct LLMs to mimic human educators' question-creation process, enabling the automated synthesis of large-scale, high-difficulty questions. We use LLMs to reverse-engineer and abstract over 120,000 design logics from existing questions across various disciplines. By matching these design logics with source documents, we are able to generate reasoning questions with controllable question types and difficulty levels. Using this pipeline, we synthesized two large-scale reasoning datasets that span 75 disciplines: DLR-Book (3.04 million questions from the book corpus) and DLR-Web (1.66 million questions from the web corpus). Data analysis indicates that the questions synthesized by our method exhibit greater difficulty and diversity compared to those in the baseline datasets. We validate our synthesized data through supervised fine-tuning (SFT) on the Qwen3 and Llama3 model families. Our data substantially enhances their multidisciplinary reasoning capabilities, outperforming existing datasets. Notably, by applying SFT on the base versions of these models using only our data, we even surpass their official final models that have undergone the full post-training process.

تم تقديم DESIGNER مؤخرًا، وهو خط أنابيب لتوليد بيانات مدفوعة بمنطق التصميم، بهدف تعزيز قدرات نماذج اللغة الكبيرة (LLMs) في معالجة الأسئلة المعقدة والمتعددة التخصصات. من خلال الاستفادة من مستندات خام واسعة، يقوم DESIGNER بإنشاء أسئلة عالية الصعوبة تتحدى قدرات التفكير لدى نماذج اللغة الكبيرة عبر مجالات مختلفة.

La reciente introducción de DESIGNER, un pipeline de síntesis de datos guiado por la lógica de diseño, tiene como objetivo mejorar las capacidades de los grandes modelos de lenguaje (LLMs) para abordar preguntas complejas y multidisciplinarias. Al aprovechar documentos en bruto extensos, DESIGNER genera preguntas de alta dificultad que desafían las habilidades de razonamiento de los LLMs en diversas disciplinas.

La récente introduction de DESIGNER, un pipeline de synthèse de données guidé par la logique de conception, vise à améliorer les capacités des grands modèles de langage (LLMs) à traiter des questions complexes et multidisciplinaires. En s'appuyant sur des documents bruts étendus, DESIGNER génère des questions de haute difficulté qui mettent au défi les capacités de raisonnement des LLMs dans divers domaines.

The recent introduction of DESIGNER, a design-logic-guided reasoning data synthesis pipeline, aims to enhance the capabilities of large language models (LLMs) in tackling complex, multidisciplinary questions. By leveraging extensive raw documents, DESIGNER generates high-difficulty questions that challenge LLMs' reasoning abilities across various disciplines.

FineGRAIN: Evaluating Failure Modes of Text-to-Image Models with Vision Language Model Judges

Was this article worth reading? Share it

IMGFX.DEV

Keywords AI

Meteoria