arXiv:2510.25818v1 Announce Type: new 
Abstract: Text-to-image diffusion models often exhibit degraded performance when generating images beyond their training resolution. Recent training-free methods can mitigate this limitation, but they often require substantial computation or are incompatible with recent Diffusion Transformer models. In this paper, we propose ScaleDiff, a model-agnostic and highly efficient framework for extending the resolution of pretrained diffusion models without any additional training. A core component of our framework is Neighborhood Patch Attention (NPA), an efficient mechanism that reduces computational redundancy in the self-attention layer with non-overlapping patches. We integrate NPA into an SDEdit pipeline and introduce Latent Frequency Mixing (LFM) to better generate fine details. Furthermore, we apply Structure Guidance to enhance global structure during the denoising process. Experimental results demonstrate that ScaleDiff achieves state-of-the-art performance among training-free methods in terms of both image quality and inference speed on both U-Net and Diffusion Transformer architectures.

تم تقديم إطار جديد يسمى ScaleDiff لتحسين دقة نماذج الانتشار من النص إلى الصورة دون الحاجة إلى حسابات مكثفة أو مشاكل في التوافق. هذه الابتكار مهم لأنه يسمح بتوليد صور عالية الجودة، مما يعالج قيدًا شائعًا تواجهه النماذج الحالية. من خلال كونه غير مرتبط بنموذج وفعال، يفتح ScaleDiff آفاقًا جديدة للمبدعين والباحثين في مجال توليد الصور، مما يسهل إنتاج صور تفصيلية كانت صعبة التحقيق سابقًا.

Se ha presentado un nuevo marco llamado ScaleDiff para mejorar la resolución de los modelos de difusión de texto a imagen sin necesidad de cálculos extensos ni problemas de compatibilidad. Esta innovación es significativa, ya que permite generar imágenes de alta calidad, abordando una limitación común que enfrentan los modelos existentes. Al ser agnóstico al modelo y eficiente, ScaleDiff abre nuevas posibilidades para creadores e investigadores en el campo de la síntesis de imágenes, facilitando la producción de visuales detallados que antes eran difíciles de lograr.

Un nouveau cadre appelé ScaleDiff a été introduit pour améliorer la résolution des modèles de diffusion texte-image sans nécessiter de calculs intensifs ni de problèmes de compatibilité. Cette innovation est importante car elle permet de générer des images de haute qualité, répondant à une limitation courante rencontrée par les modèles existants. En étant agnostique au modèle et efficace, ScaleDiff ouvre de nouvelles possibilités pour les créateurs et les chercheurs dans le domaine de la synthèse d'images, facilitant la production de visuels détaillés qui étaient auparavant difficiles à réaliser.

A new framework called ScaleDiff has been introduced to enhance the resolution of text-to-image diffusion models without the need for extensive computation or compatibility issues. This innovation is significant as it allows for higher-quality image generation, addressing a common limitation faced by existing models. By being model-agnostic and efficient, ScaleDiff opens up new possibilities for creators and researchers in the field of image synthesis, making it easier to produce detailed visuals that were previously challenging to achieve.

ScaleDiff: Higher-Resolution Image Synthesis via Efficient and Model-Agnostic Diffusion

arXiv:2511.22505v2 Announce Type: replace-cross 
Abstract: Robot manipulation in the real world is fundamentally constrained by the visual sim2real gap, where depth observations collected in simulation fail to reflect the complex noise patterns inherent to real sensors. In this work, inspired by the denoising capability of diffusion models, we invert the conventional perspective and propose a clean-to-noisy paradigm that learns to synthesize noisy depth, thereby bridging the visual sim2real gap through purely simulation-driven robotic learning. Building on this idea, we introduce RealD$^2$iff, a hierarchical coarse-to-fine diffusion framework that decomposes depth noise into global structural distortions and fine-grained local perturbations. To enable progressive learning of these components, we further develop two complementary strategies: Frequency-Guided Supervision (FGS) for global structure modeling and Discrepancy-Guided Optimization (DGO) for localized refinement. To integrate RealD$^2$iff seamlessly into imitation learning, we construct a pipeline that spans six stages. We provide comprehensive empirical and experimental validation demonstrating the effectiveness of this paradigm. RealD$^2$iff enables two key applications: (1) generating real-world-like depth to construct clean-noisy paired datasets without manual sensor data collection. (2) Achieving zero-shot sim2real robot manipulation, substantially improving real-world performance without additional fine-tuning.

قدم الباحثون RealD$^2$iff، وهو إطار جديد للانتشار الهرمي يهدف إلى معالجة الفجوة البصرية بين المحاكاة والواقع في التحكم في الروبوتات. من خلال توليف ملاحظات العمق المزعجة من خلال نموذج نظيف إلى مزعج، يعزز هذا النهج قدرة الروبوتات على العمل بفعالية في البيئات الواقعية، متجاوزًا القيود التي تفرضها طرق المحاكاة التقليدية.

Investigadores han presentado RealD$^2$iff, un nuevo marco de difusión jerárquico destinado a abordar la brecha visual sim2real en la manipulación robótica. Al sintetizar observaciones de profundidad ruidosas a través de un paradigma limpio-a-ruidoso, este enfoque mejora la capacidad de los robots para operar de manera efectiva en entornos del mundo real, superando las limitaciones impuestas por los métodos de simulación tradicionales.

Des chercheurs ont présenté RealD$^2$iff, un nouveau cadre de diffusion hiérarchique visant à résoudre le fossé visuel sim2real dans la manipulation robotique. En synthétisant des observations de profondeur bruitées à travers un paradigme propre-à-bruyant, cette approche améliore la capacité des robots à fonctionner efficacement dans des environnements réels, surmontant les limitations posées par les méthodes de simulation traditionnelles.

Researchers have introduced RealD$^2$iff, a novel hierarchical diffusion framework aimed at addressing the visual sim2real gap in robot manipulation. By synthesizing noisy depth observations through a clean-to-noisy paradigm, this approach enhances the ability of robots to operate effectively in real-world environments, overcoming limitations posed by traditional simulation methods.

RealD$^2$iff: Bridging Real-World Gap in Robot Manipulation via Depth Diffusion

arXiv:2512.07247v1 Announce Type: cross 
Abstract: Recent studies have extended diffusion-based instruction-driven 2D image editing pipelines to 3D Gaussian Splatting (3DGS), enabling faithful manipulation of 3DGS assets and greatly advancing 3DGS content creation. However, it also exposes these assets to serious risks of unauthorized editing and malicious tampering. Although imperceptible adversarial perturbations against diffusion models have proven effective for protecting 2D images, applying them to 3DGS encounters two major challenges: view-generalizable protection and balancing invisibility with protection capability. In this work, we propose the first editing safeguard for 3DGS, termed AdLift, which prevents instruction-driven editing across arbitrary views and dimensions by lifting strictly bounded 2D adversarial perturbations into 3D Gaussian-represented safeguard. To ensure both adversarial perturbations effectiveness and invisibility, these safeguard Gaussians are progressively optimized across training views using a tailored Lifted PGD, which first conducts gradient truncation during back-propagation from the editing model at the rendered image and applies projected gradients to strictly constrain the image-level perturbation. Then, the resulting perturbation is backpropagated to the safeguard Gaussian parameters via an image-to-Gaussian fitting operation. We alternate between gradient truncation and image-to-Gaussian fitting, yielding consistent adversarial-based protection performance across different viewpoints and generalizes to novel views. Empirically, qualitative and quantitative results demonstrate that AdLift effectively protects against state-of-the-art instruction-driven 2D image and 3DGS editing.

تم تقديم AdLift كوسيلة مبتكرة لحماية أصول 3D Gaussian Splatting (3DGS)، حيث يعالج الثغرات التي تطرحها عمليات التحرير المدفوعة بالتعليمات. تقوم هذه الطريقة برفع الاضطرابات العدائية ثنائية الأبعاد إلى حماية ممثلة بواسطة Gaussian ثلاثي الأبعاد، مما يضمن الحماية ضد التعديلات غير المصرح بها عبر وجهات وأبعاد مختلفة.

AdLift se ha presentado como una salvaguarda innovadora para los activos de 3D Gaussian Splatting (3DGS), abordando las vulnerabilidades que plantea la edición impulsada por instrucciones. Este método eleva las perturbaciones adversariales 2D a una salvaguarda representada por Gaussians 3D, asegurando protección contra ediciones no autorizadas en diversas vistas y dimensiones.

AdLift a été introduit comme une protection novatrice pour les actifs de 3D Gaussian Splatting (3DGS), répondant aux vulnérabilités posées par l'édition guidée par instructions. Cette méthode élève les perturbations adversariales 2D en une protection représentée par des Gaussiens 3D, garantissant une protection contre les modifications non autorisées à travers diverses vues et dimensions.

AdLift has been introduced as a pioneering safeguard for 3D Gaussian Splatting (3DGS) assets, addressing the vulnerabilities posed by instruction-driven editing. This method lifts 2D adversarial perturbations into a 3D Gaussian-represented safeguard, ensuring protection against unauthorized edits across various views and dimensions.

AdLift: Lifting Adversarial Perturbations to Safeguard 3D Gaussian Splatting Assets Against Instruction-Driven Editing

arXiv:2502.03930v4 Announce Type: replace-cross 
Abstract: Several recent studies have attempted to autoregressively generate continuous speech representations without discrete speech tokens by combining diffusion and autoregressive models, yet they often face challenges with excessive computational loads or suboptimal outcomes. In this work, we propose Diffusion Transformer Autoregressive Modeling (DiTAR), a patch-based autoregressive framework combining a language model with a diffusion transformer. This approach significantly enhances the efficacy of autoregressive models for continuous tokens and reduces computational demands. DiTAR utilizes a divide-and-conquer strategy for patch generation, where the language model processes aggregated patch embeddings and the diffusion transformer subsequently generates the next patch based on the output of the language model. For inference, we propose defining temperature as the time point of introducing noise during the reverse diffusion ODE to balance diversity and determinism. We also show in the extensive scaling analysis that DiTAR has superb scalability. In zero-shot speech generation, DiTAR achieves state-of-the-art performance in robustness, speaker similarity, and naturalness.

يمثل تقديم DiTAR، أو نمذجة التحويل التلقائي بواسطة نموذج الانتشار، تقدمًا كبيرًا في مجال توليد الكلام من خلال دمج نموذج لغة مع نموذج انتشار. يتناول هذا الإطار المبتكر التحديات الحسابية التي واجهتها النماذج التلقائية السابقة، مما يعزز فعاليتها في توليد رموز الكلام المستمرة.

La introducción de DiTAR, o Modelado Autoregresivo de Transformador de Difusión, representa un avance significativo en el campo de la generación de voz al integrar un modelo de lenguaje con un transformador de difusión. Este marco innovador aborda los desafíos computacionales que enfrentaban los modelos autoregresivos anteriores, mejorando su eficiencia para la generación de tokens de voz continua.

L'introduction de DiTAR, ou Modélisation Autoregressive par Transformateur de Diffusion, représente une avancée significative dans le domaine de la génération de la parole en intégrant un modèle de langage avec un transformateur de diffusion. Ce cadre innovant répond aux défis computationnels rencontrés par les modèles autoregressifs précédents, améliorant leur efficacité pour la génération de tokens de parole continue.

The introduction of DiTAR, or Diffusion Transformer Autoregressive Modeling, represents a significant advancement in the field of speech generation by integrating a language model with a diffusion transformer. This innovative framework addresses the computational challenges faced by previous autoregressive models, enhancing their efficiency for continuous speech token generation.

ScaleDiff: Higher-Resolution Image Synthesis via Efficient and Model-Agnostic Diffusion

Was this article worth reading? Share it

LucidQuery AI

Blunge

ImgUpscaler AI