arXiv:2511.11435v1 Announce Type: new 
Abstract: Our work addresses the ambiguity between generalization and memorization in text-to-image diffusion models, focusing on a specific case we term multimodal iconicity. This refers to instances where images and texts evoke culturally shared associations, such as when a title recalls a familiar artwork or film scene. While prior research on memorization and unlearning emphasizes forgetting, we examine what is remembered and how, focusing on the balance between recognizing cultural references and reproducing them. We introduce an evaluation framework that separates recognition, whether a model identifies a reference, from realization, how it depicts it through replication or reinterpretation, quantified through measures capturing both dimensions. By evaluating five diffusion models across 767 Wikidata-derived cultural references spanning static and dynamic imagery, we show that our framework distinguishes replication from transformation more effectively than existing similarity-based methods. To assess linguistic sensitivity, we conduct prompt perturbation experiments using synonym substitutions and literal image descriptions, finding that models often reproduce iconic visual structures even when textual cues are altered. Finally, our analysis shows that cultural alignment correlates not only with training data frequency, but also textual uniqueness, reference popularity, and creation date. Our work reveals that the value of diffusion models lies not only in what they reproduce but in how they transform and recontextualize cultural knowledge, advancing evaluation beyond simple text-image matching toward richer contextual understanding.

تتناول المقالة التوازن بين التعميم والتذكر في نماذج الانتشار من النص إلى الصورة، مع التركيز على ما يسمى 'الرمزية متعددة الوسائط.' يشير هذا المفهوم إلى كيفية استحضار الصور والنصوص لارتباطات ثقافية مشتركة. يقدم المؤلفون إطار تقييم يميز بين التعرف على المراجع الثقافية وتجسيدها في الصور. قاموا بتقييم خمسة نماذج انتشار بناءً على 767 مرجعًا ثقافيًا من ويكيداتا، مما يوضح قدرة إطارهم على التمييز بين النسخ والتحويل.

El artículo examina el equilibrio entre la generalización y la memorización en los modelos de difusión de texto a imagen, centrándose en la 'iconicidad multimodal.' Este concepto se refiere a cómo las imágenes y los textos evocan asociaciones culturales compartidas. Los autores introducen un marco de evaluación que distingue entre el reconocimiento de referencias culturales y su realización en imágenes. Evaluaron cinco modelos de difusión en base a 767 referencias culturales de Wikidata, demostrando la capacidad de su marco para diferenciar entre replicación y transformación.

L'article examine l'équilibre entre la généralisation et la mémorisation dans les modèles de diffusion texte-image, en se concentrant sur la 'iconicité multimodale.' Ce concept fait référence à la manière dont les images et les textes évoquent des associations culturelles partagées. Les auteurs introduisent un cadre d'évaluation qui distingue la reconnaissance des références culturelles de leur réalisation dans les images. Ils évaluent cinq modèles de diffusion sur la base de 767 références culturelles issues de Wikidata, démontrant la capacité de leur cadre à différencier la réplication de la…

The article examines the balance between generalization and memorization in text-to-image diffusion models, focusing on 'multimodal iconicity.' This concept refers to how images and texts evoke shared cultural associations. The authors introduce an evaluation framework that distinguishes between recognition of cultural references and their realization in images. They evaluate five diffusion models against 767 cultural references from Wikidata, demonstrating their framework's ability to differentiate between replication and transformation.

The Persistence of Cultural Memory: Investigating Multimodal Iconicity in Diffusion Models

Was this article worth reading? Share it

LucidQuery AI

Kosmik

Blunge

MemeGen AI

4o Image Gen

VibeFrame

Ready to build your own newsroom?