arXiv:2511.11683v1 Announce Type: new 
Abstract: Training and deploying multiple vision transformer (ViT) models for different resource constraints is costly and inefficient. To address this, we propose transforming a pre-trained ViT into a stratified knowledge-density super-network, where knowledge is hierarchically organized across weights. This enables flexible extraction of sub-networks that retain maximal knowledge for varying model sizes. We introduce \textbf{W}eighted \textbf{P}CA for \textbf{A}ttention \textbf{C}ontraction (WPAC), which concentrates knowledge into a compact set of critical weights. WPAC applies token-wise weighted principal component analysis to intermediate features and injects the resulting transformation and inverse matrices into adjacent layers, preserving the original network function while enhancing knowledge compactness. To further promote stratified knowledge organization, we propose \textbf{P}rogressive \textbf{I}mportance-\textbf{A}ware \textbf{D}ropout (PIAD). PIAD progressively evaluates the importance of weight groups, updates an importance-aware dropout list, and trains the super-network under this dropout regime to promote knowledge stratification. Experiments demonstrate that WPAC outperforms existing pruning criteria in knowledge concentration, and the combination with PIAD offers a strong alternative to state-of-the-art model compression and model expansion methods.

يقدم المقال نهجًا مبتكرًا لتحسين نماذج المحولات البصرية (ViT) من خلال إنشاء شبكة فائقة كثافة المعرفة ذات طبقات. تنظم هذه الطريقة المعرفة بشكل هرمي عبر الأوزان، مما يسمح بالاستخراج المرن للشبكات الفرعية التي تحتفظ بالمعرفة الأساسية لأحجام النماذج المختلفة. تم تقديم PCA الموزونة لتقليص الانتباه (WPAC) لتحسين كثافة المعرفة مع الحفاظ على وظيفة الشبكة الأصلية، مما يعالج عدم الكفاءة في تدريب نماذج ViT متعددة تحت قيود موارد مختلفة.

El artículo presenta un enfoque novedoso para optimizar los modelos de transformadores de visión (ViT) mediante la creación de una superred de densidad de conocimiento estratificada. Este método organiza el conocimiento de manera jerárquica a través de los pesos, permitiendo la extracción flexible de subredes que mantienen el conocimiento esencial para varios tamaños de modelo. La introducción de PCA Ponderada para la Contracción de Atención (WPAC) mejora la compacidad del conocimiento mientras preserva la función original de la red, abordando las ineficiencias del entrenamiento de múltiples m…

L'article présente une approche novatrice pour optimiser les modèles de transformateurs de vision (ViT) en créant un super-réseau stratifié de densité de connaissances. Cette méthode organise les connaissances de manière hiérarchique à travers les poids, permettant une extraction flexible de sous-réseaux qui conservent les connaissances essentielles pour différentes tailles de modèles. L'introduction de la PCA pondérée pour la contraction d'attention (WPAC) améliore la compacité des connaissances tout en préservant la fonction originale du réseau, répondant ainsi aux inefficacités de l'entraîn…

The article presents a novel approach to optimizing vision transformer (ViT) models by creating a stratified knowledge-density super-network. This method organizes knowledge hierarchically across weights, allowing for flexible extraction of sub-networks that maintain essential knowledge for various model sizes. The introduction of Weighted PCA for Attention Contraction (WPAC) enhances knowledge compactness while preserving the original network function, addressing the inefficiencies of training multiple ViT models under different resource constraints.

Stratified Knowledge-Density Super-Network for Scalable Vision Transformers

The story of the Ghost in the Shell’s main villain the Puppet Master hinted at a future where governments use hackers for espionage, at a time when most of the world had never connected to the internet.

الأنمي الكلاسيكي 'Ghost in the Shell' يقدم شخصية Puppet Master، التي تتنبأ بمستقبل تستخدم فيه الحكومات القراصنة للتجسس. ظهرت هذه التنبؤات في وقت كانت فيه غالبية سكان العالم لم تتصل بعد بالإنترنت، مما يبرز رؤية العرض لمشكلات الأمن السيبراني.

El clásico anime 'Ghost in the Shell' presenta al Puppet Master, un personaje que anticipa un futuro en el que los gobiernos utilizan hackers para el espionaje. Esta predicción surgió en un momento en que la mayoría de la población mundial aún no estaba conectada a Internet, destacando la previsión del programa sobre los problemas de ciberseguridad.

L'anime classique 'Ghost in the Shell' présente le Puppet Master, un personnage qui préfigure un avenir où les gouvernements utilisent des hackers pour l'espionnage. Cette prédiction est survenue à une époque où la majorité de la population mondiale n'était pas encore connectée à Internet, soulignant la prévoyance de l'émission concernant les problèmes de cybersécurité.

The classic anime 'Ghost in the Shell' features the Puppet Master, a character that foreshadows a future where governments utilize hackers for espionage. This prediction emerged at a time when the majority of the global population had yet to connect to the internet, highlighting the show's foresight regarding cybersecurity issues.

How the classic anime ‘Ghost in the Shell’ predicted the future of cybersecurity 30 years ago

<p>Text-to-image diffusion models have become the workhorses of generative imaging. They can paint photorealistic scenes, mimic art styles, and blend concepts in ways that were science fiction a few years ago. Yet they stumble embarrassingly on a skill that even small children master: basic spatial reasoning.</p>

<p>Ask a state-of-the-art model for “a dog to the right of a teddy bear” and you often get:</p>

<ul>
<li>The dog on the left</li>
<li>One of the objects missing</li>
<li>Or a bizarre hybrid where dog and teddy are fused into a single creature</li>
</ul>

<p><a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F49rtb08366xdl284o4z0.jpg" class="article-body-image-wrapper"><img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F49rtb08366xdl284o4z0.jpg" alt=" " width="800" height="532"></a></p>

<p>These failures become more severe for unusual compositions like “a giraffe above an airplane”. Traditional fixes range from expensive fine-tuning to brittle, hand-written loss functions at inference time—but both options come with significant downsides.</p>

<p>NVIDIA’s Learn-to-Steer framework (accepted to WACV 2026) proposes a different path: instead of hard-coding spatial rules or retraining the entire model, it learns a data-driven objective that can “steer” diffusion at inference time. The method reads the model’s own cross-attention maps, trains a lightweight classifier to detect spatial relations, and then uses that classifier’s gradient as a learned loss to nudge the generation towards layouts that match the prompt.</p>

<p>In this blog, we’ll unpack:</p>

<ul>
<li>What makes spatial reasoning so fragile in current diffusion models</li>
<li>How Learn-to-Steer learns spatial constraints from the model itself</li>
<li>How it steers images during generation without changing model weights</li>
<li>The top gains on spatial benchmarks like GenEval and T2I-CompBench</li>
<li>The trade-offs in compute cost and generality, and what this implies for future generative systems</li>
</ul>

<h1>
  
  
  Why Spatial Reasoning Fails in Text-to-Image Diffusion
</h1>

<h2>
  
  
  What Makes Spatial Relations So Difficult for Diffusion Models?
</h2>

<p>Modern diffusion models (e.g., Stable Diffusion, Flux) are excellent at what should appear in an image—objects, styles, textures—but much less reliable at where those objects should be.</p>

<p>Several factors contribute:</p>

<h3>
  
  
  Weak supervision of spatial language
</h3>

<ul>
<li>Training data rarely comes with precise annotations like “object A is left of object B”.
</li>
<li>Captions often describe content loosely, so phrases like “on top of” or “to the right of” are under-specified.</li>
</ul>

<h3>
  
  
  Entangled visual concepts
</h3>

<ul>
<li>When two objects frequently co-occur, models may treat them as a single visual blob.</li>
<li>This leads to object fusion, where a “cat on a bookshelf” becomes a cat-bookshelf chimera.</li>
</ul>

<h3>
  
  
  Benchmark saturation without spatial coverage
</h3>

<ul>
<li>Many standard text-to-image benchmarks emphasize realism and style, not relational accuracy.</li>
<li>Models can score highly while still being spatially confused.</li>
</ul>

<p>Empirical studies confirm three recurring failure modes on spatial benchmarks:</p>

<ul>
<li>Incorrect placement: Objects appear in the wrong relative position.</li>
<li>Missing entities: One or more requested objects never appear.</li>
<li>Merged entities: Two objects get mashed into a single, incoherent form.</li>
</ul>

<p>The model “knows” the objects you asked for, but it doesn’t reliably understand where to place them.</p>

<h1>
  
  
  Why Fine-Tuning and Handcrafted Losses Are Not Enough
</h1>

<p>Two broad strategies have tried to patch this gap:</p>

<h2>
  
  
  Fine-tuning for spatial awareness
</h2>

<ul>
<li>Retrain the diffusion model on datasets with explicit layouts or spatial annotations.</li>
<li>Methods like COMPASS show that this can significantly improve spatial accuracy.</li>
<li>But this comes at a cost: expensive retraining, sensitivity to dataset bias, and often regressions in other capabilities such as color fidelity or counting.</li>
</ul>

<h2>
  
  
  Handcrafted test-time losses
</h2>

<ul>
<li>At inference, inject extra loss terms that penalize spatial errors (e.g., overlapping activation maps, incorrect ordering).</li>
<li>These losses must be manually designed to approximate relations like “left of” or “above”.</li>
<li>In practice, these heuristics are fragile, often over-fitting simple cases and failing on more complex layouts.</li>
</ul>

<p>In short, we’ve lacked a solution that is:</p>

<ul>
<li>Data-driven rather than rule-based</li>
<li>Plug-and-play at inference time (no full retraining)</li>
<li>Targeted enough to improve spatial reasoning without damaging other strengths</li>
</ul>

<p>This is where Learn-to-Steer enters.</p>

<h1>
  
  
  How Learn-to-Steer Works: Data-Driven Steering at Inference
</h1>

<h2>
  
  
  How Cross-Attention Maps Provide a Spatial Signal
</h2>

<p>During diffusion, at each denoising step, the model computes cross-attention maps that connect text tokens to image regions. For a prompt like “a dog to the right of a teddy bear”, you can think of:</p>

<ul>
<li>One set of attention maps for “dog”</li>
<li>Another set for “teddy bear”</li>
<li>Additional context around words like “right” or “of”</li>
</ul>

<p>These maps form a rich, high-dimensional signal describing where in the image the model currently believes each word should manifest. Prior work has used cross-attention to locate objects or edit images; Learn-to-Steer goes further by treating them as a feature space in which spatial relations can be learned.</p>

<h2>
  
  
  How a Relation Classifier Becomes a Learned Loss
</h2>

<p>The core idea of Learn-to-Steer is to train a small relation classifier that takes cross-attention maps for two objects and predicts the spatial relation between them (left-of, right-of, above, below, etc.).</p>

<p>The pipeline looks like this:</p>

<h3>
  
  
  Collect supervision
</h3>

<ul>
<li>Use images where the true relation between object A and object B is known (from datasets like GQA and synthetic layouts).</li>
<li>For each image, invert it through the diffusion model with a descriptive prompt to recover cross-attention maps for the relevant tokens.</li>
</ul>

<p><a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fi9dbjsdc4c8yjz2r88k4.jpg" class="article-body-image-wrapper"><img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fi9dbjsdc4c8yjz2r88k4.jpg" alt=" " width="800" height="446"></a></p>

<h3>
  
  
  Train a classifier on attention patterns
</h3>

<ul>
<li>Input: attention maps for object A and object B.</li>
<li>Output: predicted relation (e.g., “A is left of B”).</li>
</ul>

<p>Naively, however, this leads to a subtle but serious issue: relation leakage.</p>

<h2>
  
  
  How Dual Inversion Solves the “Relation Leakage” Problem
</h2>

<p>If you always invert images with a correct prompt (e.g., “a dog to the left of a cat”), hints about the word “left” can leak into the attention patterns. A naïve classifier might then “cheat” by reading out linguistic artefacts instead of learning genuine visual geometry.</p>

<p>To prevent this, Learn-to-Steer uses a dual inversion strategy:</p>

<ul>
<li>For each image with a true relation (say, dog left of cat), create two prompts:

<ul>
<li>A positive prompt with the correct relation (“dog to the left of a cat”).</li>
<li>A negative prompt with an incorrect relation (“dog above a cat”).</li>
</ul>


</li>

<li>Run inversion with both prompts, obtaining two sets of attention maps.</li>

<li>Label both sets with the true relation (left-of), because that is what the image actually depicts.</li>

</ul>

<p>The classifier sees pairs of attention maps that share the same underlying geometry but differ in the relation words used in the prompt. To succeed, it must ignore the unreliable linguistic cue and zero in on the geometric evidence in the attention patterns. This breaks the leakage shortcut and yields a classifier that actually understands “left-of” in terms of where things appear in the model’s internal vision.</p>

<p>To improve robustness, NVIDIA combines:</p>

<ul>
<li>Real images (complex, natural scenes)</li>
<li>Synthetic images (simpler, cleaner attention patterns akin to generation scenarios)</li>
</ul>

<h1>
  
  
  How Learn-to-Steer Guides Images During Generation
</h1>

<h2>
  
  
  Step-by-Step: From Prompt to Steered Latent
</h2>

<p>Once the relation classifier is trained, Learn-to-Steer uses it at inference time as a learned objective:</p>

<h3>
  
  
  Parse the spatial prompt
</h3>

<ul>
<li>Extract subject, relation, and object from the text (e.g., subject = dog, relation = right-of, object = teddy bear).</li>
</ul>

<h3>
  
  
  Run diffusion as usual—but with checkpoints
</h3>

<ul>
<li>As the model denoises latent noise into an image, periodically extract cross-attention maps for the subject and object tokens.</li>
</ul>

<h3>
  
  
  Evaluate spatial correctness
</h3>

<ul>
<li>Feed these maps into the relation classifier, which outputs a probability distribution over relations.</li>
<li>Compare this distribution to the desired relation from the prompt, and compute a loss (e.g., cross-entropy).</li>
</ul>

<h3>
  
  
  Backpropagate into the latent
</h3>

<ul>
<li>Compute the gradient of this loss with respect to the latent representation at that timestep.</li>
<li>Nudge the latent in the direction that increases the classifier’s confidence in the correct relation.</li>
</ul>

<h3>
  
  
  Continue the diffusion process
</h3>

<ul>
<li>Let the denoising proceed from the adjusted latent.</li>
<li>Repeat this steering a number of times (often during the earlier half of the diffusion steps).</li>
</ul>

<h2>
  
  
  Support for Multiple Architectures and Relations
</h2>

<p><a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F578a9bjc7gmtemh0jbsj.jpg" class="article-body-image-wrapper"><img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F578a9bjc7gmtemh0jbsj.jpg" alt=" " width="800" height="477"></a></p>

<p>A key advantage of Learn-to-Steer is that it’s architecture-agnostic:</p>

<ul>
<li>It has been demonstrated on both UNet-based models (like Stable Diffusion 1.4/2.1) and MMDiT-style models (like Flux).</li>
<li>The only requirement is access to a text-image alignment signal (cross-attention or similar).</li>
</ul>

<p>It can also handle prompts with multiple constraints, such as:</p>

<p>“A frog above a sneaker below a teapot.”</p>

<p>Here, Learn-to-Steer alternates attention between relations:</p>

<ul>
<li>At one timestep, optimize the frog–sneaker relation.</li>
<li>At another, optimize the sneaker–teapot relation.</li>
</ul>

يهدف Learn-to-Steer من NVIDIA إلى معالجة قيد كبير في نماذج الانتشار من النص إلى الصورة، التي تعاني من ضعف في التفكير المكاني الأساسي. يمكن لهذه النماذج إنشاء صور فوتوغرافية واقعية، لكنها غالبًا ما تضع الأشياء في غير موضعها، مثل وضع كلب على اليسار بدلاً من اليمين بجانب دمية دب. تهدف هذه الخطوة إلى تحسين دقة الصور المولدة من خلال تعزيز الفهم المكاني.

El Learn-to-Steer de NVIDIA busca abordar una limitación significativa en los modelos de difusión de texto a imagen, que luchan con el razonamiento espacial básico. Estos modelos pueden crear imágenes fotorealistas, pero a menudo colocan mal los objetos en relación entre sí, como poner un perro a la izquierda de un oso de peluche en lugar de a la derecha. Este avance tiene como objetivo mejorar la precisión de las imágenes generadas al mejorar la comprensión espacial.

Le Learn-to-Steer de NVIDIA vise à résoudre une limitation importante des modèles de diffusion texte-image, qui ont du mal avec le raisonnement spatial de base. Ces modèles peuvent créer des images photoréalistes mais placent souvent mal les objets les uns par rapport aux autres, comme mettre un chien à gauche d'un ours en peluche au lieu de la droite. Cette avancée vise à améliorer l'exactitude des images générées en renforçant la compréhension spatiale.

NVIDIA's Learn-to-Steer is set to address a significant limitation in text-to-image diffusion models, which struggle with basic spatial reasoning. These models can create photorealistic images but often misplace objects in relation to one another, such as placing a dog to the left of a teddy bear instead of the right. This advancement aims to enhance the accuracy of generated images by improving spatial understanding.

What Is Learn-to-Steer? NVIDIA’s 2025 Spatial Fix for Text-to-Image Diffusion

<p>The $5tn firm handily beat expectations, but analysts are awaiting projections for future demand for firm’s AI chips</p><p>Nvidia shares are rising in after-market trading after the company posted third quarter earnings that beat Wall Street estimates.<strong> </strong>All eyes were on Nvidia, the bellwether for the AI industry and the most valuable publicly traded company in the world, as analysts and investors hoped the chipmaker’s third-quarter earnings would assuage concerns about whether the high-flying valuations of AI firms have peaked.</p><p>“Blackwell sales are off the charts, and cloud GPUs are sold out,” said Jensen Huang, founder and CEO of Nvidia in a press release. “Compute demand keeps accelerating and compounding across training and inference – each growing exponentially. We’ve entered the virtuous cycle of AI. The AI ecosystem is scaling fast – with more new foundation model makers, more AI startups, across more industries, and in more countries. AI is going everywhere, doing everything, all at once.”</p> <a href="https://www.theguardian.com/technology/2025/nov/19/nvidia-earning-report">Continue reading...</a>

تجاوزت شركة إنفيديا توقعات وول ستريت مع نتائجها للربع الثالث، مما أظهر طلبًا قويًا على شرائح الذكاء الاصطناعي الخاصة بها. ارتفعت أسهم الشركة في التداول بعد السوق، مما يعكس ثقة المستثمرين وسط مخاوف بشأن تقييم سوق الذكاء الاصطناعي. وأبرز الرئيس التنفيذي جينسن هوانغ مبيعات قياسية ونظامًا بيئيًا سريع التوسع في مجال الذكاء الاصطناعي، مما يشير إلى نظرة إيجابية لمستقبل الشركة.

Nvidia superó las expectativas de Wall Street con sus ganancias del tercer trimestre, mostrando una fuerte demanda por sus chips de IA. Las acciones de la compañía aumentaron en el comercio posterior al cierre, reflejando la confianza de los inversores en medio de preocupaciones sobre la valoración del mercado de IA. El CEO Jensen Huang destacó las ventas récord y un ecosistema de IA en rápida expansión, indicando una perspectiva positiva para el futuro de la empresa.

Nvidia a dépassé les attentes de Wall Street avec ses résultats du troisième trimestre, montrant une forte demande pour ses puces d'IA. Les actions de l'entreprise ont augmenté lors des échanges après la clôture, reflétant la confiance des investisseurs face aux préoccupations concernant la valorisation du marché de l'IA. Le PDG Jensen Huang a souligné des ventes record et un écosystème IA en pleine expansion, indiquant une perspective positive pour l'avenir de l'entreprise.

Nvidia exceeded Wall Street expectations with its third-quarter earnings, showcasing strong demand for its AI chips. The company's shares rose in after-market trading, reflecting investor confidence amid concerns about the AI market's valuation. CEO Jensen Huang highlighted record sales and a rapidly expanding AI ecosystem, indicating a positive outlook for the company's future.

‘AI is going everywhere, doing everything:’ Nvidia beats Wall Street estimates amid market selloff and AI bubble fears

The SanDisk ExtremeFit USB-C flash drive is barely three grams, but offers 1TB of external storage and impressive speeds.

I refused to believe this coin-sized gadget was a storage drive, until I tried it for myself

<p>Swift 6.3 is bringing significant enhancements to Embedded Swift, the subset of Swift designed for resource-constrained environments like microcontrollers. Here's what's new:</p>

<h2>
  
  
  Key Improvements
</h2>

<h3>
  
  
  Libraries &amp; Standard Library
</h3>

<ul>
<li>
<strong>Floating-point printing</strong>: The <code>description</code> and <code>debugDescription</code> properties now work for Float, Double, and other floating-point types with a new all-Swift implementation</li>
<li>
<strong>Better diagnostics</strong>: New <code>EmbeddedRestrictions</code> diagnostic group warns about unsupported language constructs</li>
<li>
<strong>Swift MMIO 0.1.x</strong>: Includes code generation from SVD files and improved debugging with SVD2LLDB plugin</li>
</ul>

<h3>
  
  
  C Interoperability
</h3>

<ul>
<li>
<strong><code>@c</code> attribute</strong>: Define C-compatible functions and enums (from SE-0495)
</li>
</ul>

<div class="highlight js-code-highlight">
<pre class="highlight swift"><code><span class="kd">@c</span><span class="p">(</span><span class="kt">MyLib_initialize</span><span class="p">)</span>
<span class="kd">public</span> <span class="kd">func</span> <span class="nf">initialize</span><span class="p">()</span> <span class="p">{</span> <span class="o">...</span> <span class="p">}</span>
</code></pre>

</div>



<ul>
<li>
<strong>Improved type matching</strong>: Better tolerance for mismatching C signatures, eliminating cryptic deserialization errors</li>
</ul>

<h3>
  
  
  Debugging
</h3>

<ul>
<li>
<strong>Enhanced LLDB support</strong>: Better value printing for Embedded Swift types</li>
<li>
<strong>Core dump inspection</strong>: Dictionary, Array, and other common types now inspectable without a live process</li>
<li>
<strong>ARMv7m exception unwinding</strong>: Complete backtraces through exception frames</li>
</ul>

<h3>
  
  
  Linking &amp; Compilation
</h3>

<ul>
<li>
<strong><code>@section</code> and <code>@used</code> attributes</strong>: Control where globals are emitted and ensure symbols aren't stripped (SE-0492)</li>
<li>
<strong>Weak symbol definitions</strong>: Fixes duplicate symbol errors in diamond dependencies</li>
<li>
<strong><code>@export</code> attribute</strong>: Better control over function visibility (SE-0497)</li>
</ul>




<p><em>Want to dive deeper? Read the <a href="https://www.swift.org/blog/embedded-swift-improvements-coming-in-swift-6.3/" rel="noopener noreferrer">full announcement</a> on Swift.org</em></p>

تقدم Swift 6.3 تحسينات كبيرة على Embedded Swift، مما يعزز وظيفته في البيئات ذات الموارد المحدودة مثل المتحكمات الدقيقة. تشمل التحسينات الرئيسية قدرات جديدة لطباعة الأعداد العشرية، وتشخيصات أفضل مع مجموعة EmbeddedRestrictions، وإدخال Swift MMIO 0.1.x لتوليد الشيفرة وتصحيح الأخطاء.

Swift 6.3 presenta mejoras significativas en Embedded Swift, aumentando su funcionalidad para entornos con recursos limitados como microcontroladores. Las mejoras clave incluyen nuevas capacidades de impresión de números de punto flotante, mejores diagnósticos con el grupo EmbeddedRestrictions y la introducción de Swift MMIO 0.1.x para la generación de código y la depuración.

Swift 6.3 apporte des améliorations significatives à Embedded Swift, renforçant sa fonctionnalité pour les environnements à ressources limitées comme les microcontrôleurs. Les principales améliorations comprennent de nouvelles capacités d'impression de nombres à virgule flottante, de meilleurs diagnostics avec le groupe EmbeddedRestrictions, et l'introduction de Swift MMIO 0.1.x pour la génération de code et le débogage.

Swift 6.3 introduces significant upgrades to Embedded Swift, enhancing its functionality for resource-constrained environments like microcontrollers. Key improvements include new floating-point printing capabilities, better diagnostics with the EmbeddedRestrictions group, and the introduction of Swift MMIO 0.1.x for code generation and debugging.

Embedded Swift Gets Major Upgrades in Swift 6.3

<a href="https://www.techspot.com/news/110317-judge-dismisses-lawsuit-twice-due-alleged-deepfake-video.html" target="_blank"><img src="https://www.techspot.com/images2/news/ts3_thumbs/2025/11/2025-11-19-ts3_thumbs-252.jpg" width="800" height="560" style="padding: 15px 0" title="Judge dismisses lawsuit twice due to alleged deepfake video testimony" /></a><br />A California housing dispute is getting media attention over allegations that lawyers presented a deepfake video as witness testimony. NBC News reports that Judge Victoria Kolakowski became suspicious after the supposed witness showed signs that something was not right, including a monotone voice, fuzzy facial features, and repeated facial expressions....<br /><br /><a href="https://www.techspot.com/news/110317-judge-dismisses-lawsuit-twice-due-alleged-deepfake-video.html">Read Entire Article</a><br /><br />

تجذب نزاع سكني في كاليفورنيا الانتباه الإعلامي بعد ظهور مزاعم بأن المحامين قدموا فيديو مزيف كدليل شهود. أعربت القاضية فيكتوريا كولاكوفسكي عن شكوكها بشأن الفيديو، مشيرة إلى صوت الشاهد الأحادي، وملامح الوجه غير الواضحة، وتكرار التعبيرات. أدى ذلك إلى رفض الدعوى القضائية مرتين.

Una disputa de vivienda en California ha llamado la atención de los medios tras las alegaciones de que los abogados presentaron un video deepfake como testimonio. La jueza Victoria Kolakowski expresó su escepticismo sobre el video, señalando la voz monótona del testigo, rasgos faciales borrosos y expresiones repetitivas. Esto llevó al desestimado de la demanda en dos ocasiones.

Un litige immobilier en Californie suscite l'attention des médias après des allégations selon lesquelles des avocats auraient présenté une vidéo deepfake comme témoignage. La juge Victoria Kolakowski a exprimé des doutes sur la vidéo, notant la voix monotone du témoin, des traits faciaux flous et des expressions répétitives. Cela a conduit à l'annulation de la poursuite à deux reprises.

A California housing dispute has drawn attention after allegations surfaced that lawyers presented a deepfake video as witness testimony. Judge Victoria Kolakowski expressed skepticism about the video, noting the witness's monotone voice, unclear facial features, and repetitive expressions. This led to the dismissal of the lawsuit on two occasions.

Judge dismisses lawsuit twice due to alleged deepfake video testimony

arXiv:2505.21501v3 Announce Type: replace 
Abstract: Vision Transformers (ViTs) have emerged as the dominant architecture for visual processing tasks, demonstrating excellent scalability with increased training data and model size. However, recent work has identified the emergence of artifact tokens in ViTs that are incongruous with local semantics. These anomalous tokens degrade ViT performance in tasks that require fine-grained localization or structural coherence. An effective mitigation of this issue is the addition of register tokens to ViTs, which implicitly "absorb" the artifact term during training. Given the availability of existing large-scale pre-trained ViTs, in this paper we seek add register tokens to existing models without needing to re-train from scratch, which is infeasible considering their size. Specifically, we propose Post Hoc Registers (PH-Reg), an efficient self-distillation method that integrates registers into an existing ViT without requiring additional labeled data and full retraining. PH-Reg initializes both teacher and student networks from the same pre-trained ViT. The teacher remains frozen and unmodified, while the student is augmented with randomly initialized register tokens. By applying test-time augmentation to the teacher's inputs, we generate denoised dense embeddings free of artifacts, which are then used to optimize only a small subset of unlocked student weights. We show that our approach can effectively reduce the number of artifact tokens, improving the segmentation and depth prediction of the student ViT under zero-shot and linear probing.

أصبحت محولات الرؤية (ViTs) هي الهيكل السائد لمهام معالجة الصور، حيث أظهرت قابلية رائعة للتوسع مع زيادة بيانات التدريب وحجم النموذج. ومع ذلك، حددت الأعمال الأخيرة ظهور رموز غير مرغوب فيها في ViTs تتعارض مع الدلالات المحلية. هذه الرموز الشاذة تؤثر سلبًا على أداء ViT في المهام التي تتطلب تحديدًا دقيقًا أو تماسكًا هيكليًا. يقدم هذا البحث إضافة رموز التسجيل إلى ViTs كوسيلة فعالة للتخفيف من هذه المشكلة، حيث يقترح استخدام رموز ما بعد التقييم (PH-Reg) كطريقة فعالة للاستخلاص الذاتي.

Los Vision Transformers (ViTs) se han convertido en la arquitectura dominante para tareas de procesamiento visual, demostrando una excelente escalabilidad con conjuntos de datos de entrenamiento y tamaños de modelo más grandes. Sin embargo, trabajos recientes han identificado la aparición de tokens de artefacto en los ViTs que son incongruentes con la semántica local. Estos tokens anómalos degradan el rendimiento de los ViTs en tareas que requieren localización precisa o coherencia estructural. Este artículo propone la adición de tokens de registro a los ViTs como una mitigación eficaz de este…

Les Vision Transformers (ViTs) sont devenus l'architecture dominante pour les tâches de traitement visuel, montrant une évolutivité remarquable avec des ensembles de données d'entraînement et des tailles de modèles plus importants. Cependant, des travaux récents ont révélé la présence de tokens d'artéfacts dans les ViTs qui sont en désaccord avec la sémantique locale, ce qui nuit aux performances dans les tâches nécessitant une localisation précise et une cohérence structurelle. Cet article propose d'ajouter des tokens d'enregistrement pour atténuer ce problème, en introduisant les Post Hoc Re…

Vision Transformers (ViTs) have become the leading architecture for visual processing tasks, showcasing remarkable scalability with larger training datasets and model sizes. However, recent findings have revealed the presence of artifact tokens in ViTs that conflict with local semantics, negatively impacting performance in tasks requiring precise localization and structural coherence. This paper introduces register tokens to mitigate this issue, proposing Post Hoc Registers (PH-Reg) as an efficient self-distillation method to integrate these tokens into existing ViTs without the need for retra…

Vision Transformers with Self-Distilled Registers

arXiv:2511.13221v1 Announce Type: new 
Abstract: The transformer architecture has demonstrated strong performance in classification tasks involving structured and high-dimensional data. However, its success often hinges on large- scale training data and careful regularization to prevent overfitting. In this paper, we intro- duce a novel likelihood-guided variational Ising-based regularization framework for Vision Transformers (ViTs), which simultaneously enhances model generalization and dynamically prunes redundant parameters. The proposed variational Ising-based regularization approach leverages Bayesian sparsification techniques to impose structured sparsity on model weights, allowing for adaptive architecture search during training. Unlike traditional dropout-based methods, which enforce fixed sparsity patterns, the variational Ising-based regularization method learns task-adaptive regularization, improving both efficiency and interpretability. We evaluate our approach on benchmark vision datasets, including MNIST, Fashion-MNIST, CIFAR-10, and CIFAR-100, demonstrating improved generalization under sparse, complex data and allowing for principled uncertainty quantification on both weights and selection parameters. Additionally, we show that the Ising regularizer leads to better-calibrated probability estimates and structured feature selection through uncertainty-aware attention mechanisms. Our results highlight the effectiveness of structured Bayesian sparsification in enhancing transformer-based architectures, offering a principled alternative to standard regularization techniques.

يقدم المقال إطارًا جديدًا لتنظيم النماذج المعتمدة على الانتباه، يعتمد على إيسينغ، موجهًا بواسطة الاحتمالية، لزيادة تعميم نموذج Vision Transformers (ViTs) مع تقليم ديناميكي للمعلمات الزائدة. تستخدم هذه الطريقة تقنيات التخفيف البايزية لفرض تشتت هيكلي على أوزان النموذج، مما يسمح بالبحث عن هيكلية تكيفية أثناء التدريب. على عكس طرق الإسقاط التقليدية، يتعلم هذا الإطار تنظيمًا تكيفيًا للمهمة، مما يحسن الكفاءة وقابلية التفسير.

El artículo presenta un nuevo marco de regularización basado en Ising guiado por la verosimilitud para los Vision Transformers (ViTs), con el objetivo de mejorar la generalización del modelo mientras se eliminan dinámicamente los parámetros redundantes. Este enfoque utiliza técnicas de sparsificación bayesiana para imponer una sparsidad estructurada en los pesos del modelo, permitiendo una búsqueda de arquitectura adaptativa durante el entrenamiento. A diferencia de los métodos tradicionales de dropout, este marco aprende una regularización adaptativa a la tarea, mejorando la eficiencia y la i…

Cet article présente un nouveau cadre de régularisation basé sur l'Ising, guidé par la vraisemblance, pour les Vision Transformers (ViTs), visant à améliorer la généralisation du modèle tout en élaguant dynamiquement les paramètres redondants. Cette approche utilise des techniques de sparsification bayésienne pour imposer une sparsité structurée sur les poids du modèle, permettant une recherche d'architecture adaptative pendant l'entraînement. Contrairement aux méthodes de dropout traditionnelles, ce cadre apprend une régularisation adaptative aux tâches, améliorant l'efficacité et l'interprét…

The paper introduces a novel likelihood-guided variational Ising-based regularization framework for Vision Transformers (ViTs), aimed at enhancing model generalization while dynamically pruning redundant parameters. This approach utilizes Bayesian sparsification techniques to impose structured sparsity on model weights, allowing for adaptive architecture search during training. Unlike traditional dropout methods, this framework learns task-adaptive regularization, improving efficiency and interpretability in classification tasks involving structured and high-dimensional data.

Likelihood-guided Regularization in Attention Based Models

arXiv:2511.14268v1 Announce Type: cross 
Abstract: Heterogeneous porous materials play a crucial role in various engineering systems. Microstructure characterization and reconstruction provide effective means for modeling these materials, which are critical for conducting physical property simulations, structure-property linkage studies, and enhancing their performance across different applications. To achieve superior controllability and applicability with small sample sizes, we propose a statistically controllable microstructure reconstruction framework that integrates neural networks with sliced-Wasserstein metric. Specifically, our approach leverages local pattern distribution for microstructure characterization and employs a controlled sampling strategy to generate target distributions that satisfy given conditional parameters. A neural network-based model establishes the mapping from the input distribution to the target local pattern distribution, enabling microstructure reconstruction. Combinations of sliced-Wasserstein metric and gradient optimization techniques minimize the distance between these distributions, leading to a stable and reliable model. Our method can perform stochastic and controllable reconstruction tasks even with small sample sizes. Additionally, it can generate large-size (e.g. 512 and 1024) 3D microstructures using a chunking strategy. By introducing spatial location masks, our method excels at generating spatially heterogeneous and complex microstructures. We conducted experiments on stochastic reconstruction, controllable reconstruction, heterogeneous reconstruction, and large-size microstructure reconstruction across various materials. Comparative analysis through visualization, statistical measures, and physical property simulations demonstrates the effectiveness, providing new insights and possibilities for research on structure-property linkage and material inverse design.

تم اقتراح إطار عمل جديد لإعادة بناء الميكروستركشر للمواد المسامية غير المتجانسة، حيث يتم دمج الشبكات العصبية مع مقياس ووترستين المقطوع. تعزز هذه الطريقة من توصيف وإعادة بناء الميكروستركشر، وهما أمران أساسيان لنمذجة هذه المواد في التطبيقات الهندسية. من خلال استخدام توزيع الأنماط المحلية واستراتيجية أخذ عينات محكومة، يهدف الإطار إلى تحسين القابلية للتحكم والتطبيق في إعادة بناء الميكروستركشر، حتى مع أحجام عينات صغيرة.

Se ha propuesto un nuevo marco para la reconstrucción de la microestructura de materiales heterogéneos porosos, integrando redes neuronales con la métrica de Wasserstein cortada. Este enfoque mejora la caracterización y reconstrucción de la microestructura, que son esenciales para modelar materiales en aplicaciones de ingeniería. Al utilizar la distribución de patrones locales y una estrategia de muestreo controlado, el marco busca mejorar la controlabilidad y aplicabilidad de la reconstrucción de microestructuras, incluso con tamaños de muestra pequeños.

Un nouveau cadre pour la reconstruction de la microstructure des matériaux hétérogènes poreux a été proposé, intégrant des réseaux de neurones avec la métrique de Wasserstein tranchée. Cette approche améliore la caractérisation et la reconstruction de la microstructure, essentielles pour modéliser les matériaux dans les applications d'ingénierie. En utilisant la distribution des motifs locaux et une stratégie d'échantillonnage contrôlé, le cadre vise à améliorer la contrôlabilité et l'applicabilité de la reconstruction de la microstructure, même avec de petites tailles d'échantillons.

A new framework for reconstructing the microstructure of heterogeneous porous materials has been proposed, integrating neural networks with the sliced-Wasserstein metric. This approach enhances microstructure characterization and reconstruction, which are essential for modeling materials in engineering applications. By utilizing local pattern distribution and a controlled sampling strategy, the framework aims to improve the controllability and applicability of microstructure reconstruction, even with small sample sizes.

Statistically controllable microstructure reconstruction framework for heterogeneous materials using sliced-Wasserstein metric and neural networks

arXiv:2408.00540v4 Announce Type: replace-cross 
Abstract: Artificial Intelligence (AI) is being incorporated in several optimization, scheduling, orchestration as well as in native communication network functions. This paradigm shift results in increased energy consumption, however, quantifying the end-to-end energy consumption of adding intelligence to communication systems remains an open challenge since conventional energy consumption metrics focus on either communication, computation infrastructure, or model development. To address this, we propose a new metric, the Energy Cost of AI Lifecycle (eCAL) of an AI model in a system. eCAL captures the energy consumption throughout the development, deployment and utilization of an AI-model providing intelligence in a communication network by (i) analyzing the complexity of data collection and manipulation in individual components and (ii) deriving overall and per-bit energy consumption. We show that as a trained AI model is used more frequently for inference, its energy cost per inference decreases, since the fixed training energy is amortized over a growing number of inferences. For a simple case study we show that eCAL for 100 inferences is 2.73 times higher than for 1000 inferences. Additionally, we have developed a modular and extendable open-source simulation tool to enable researchers, practitioners, and engineers to calculate the end-to-end energy cost with various configurations and across various systems, ensuring adaptability to diverse use cases.

يتناول المقال دمج الذكاء الاصطناعي (AI) في شبكات الاتصال، مشيرًا إلى زيادة استهلاك الطاقة المرتبطة بهذا التحول. يقدم مقياسًا جديدًا يسمى تكلفة الطاقة لدورة حياة الذكاء الاصطناعي (eCAL)، والذي يقيس الطاقة المستخدمة خلال تطوير ونشر واستخدام نماذج الذكاء الاصطناعي في أنظمة الاتصال. تؤكد الدراسة على الحاجة إلى فهم شامل لمقاييس استهلاك الطاقة، التي تركز تقليديًا على الاتصال أو بنية الحوسبة أو تطوير النماذج.

El artículo aborda la integración de la inteligencia artificial (IA) en las redes de comunicación, destacando el aumento del consumo de energía asociado con este cambio. Presenta una nueva métrica llamada Costo Energético del Ciclo de Vida de la IA (eCAL), que cuantifica la energía utilizada durante el desarrollo, implementación y utilización de modelos de IA en sistemas de comunicación. El estudio enfatiza la necesidad de una comprensión integral de las métricas de consumo de energía, que tradicionalmente se centran en la comunicación, infraestructura de computación o desarrollo de modelos.

L'article traite de l'intégration de l'intelligence artificielle (IA) dans les réseaux de communication, soulignant l'augmentation de la consommation d'énergie associée à ce changement. Il présente un nouveau métrique appelé le Coût Énergétique du Cycle de Vie de l'IA (eCAL), qui quantifie l'énergie utilisée lors du développement, du déploiement et de l'utilisation des modèles d'IA dans les systèmes de communication. L'étude met en avant la nécessité d'une compréhension globale des métriques de consommation d'énergie, qui se concentrent traditionnellement sur la communication, l'infrastructure…

The article discusses the integration of Artificial Intelligence (AI) into communication networks, highlighting the increased energy consumption associated with this shift. It presents a new metric called the Energy Cost of AI Lifecycle (eCAL), which quantifies the energy used during the development, deployment, and utilization of AI models in communication systems. The study emphasizes the need for a comprehensive understanding of energy consumption metrics, which traditionally focus on communication, computation infrastructure, or model development.

The Energy Cost of Artificial Intelligence Lifecycle in Communication Networks

arXiv:2511.14465v1 Announce Type: new 
Abstract: Mechanistic interpretability research requires reliable tools for analyzing transformer internals across diverse architectures. Current approaches face a fundamental tradeoff: custom implementations like TransformerLens ensure consistent interfaces but require coding a manual adaptation for each architecture, introducing numerical mismatch with the original models, while direct HuggingFace access through NNsight preserves exact behavior but lacks standardization across models. To bridge this gap, we develop nnterp, a lightweight wrapper around NNsight that provides a unified interface for transformer analysis while preserving original HuggingFace implementations. Through automatic module renaming and comprehensive validation testing, nnterp enables researchers to write intervention code once and deploy it across 50+ model variants spanning 16 architecture families. The library includes built-in implementations of common interpretability methods (logit lens, patchscope, activation steering) and provides direct access to attention probabilities for models that support it. By packaging validation tests with the library, researchers can verify compatibility with custom models locally. nnterp bridges the gap between correctness and usability in mechanistic interpretability tooling.

يتناول المقال nnterp، وهي أداة جديدة مصممة لتعزيز البحث في التفسير الميكانيكي لنماذج المحولات. تواجه الأساليب الحالية تحديات في التوحيد والدقة العددية عند تحليل هياكل مختلفة. تعمل nnterp كغلاف خفيف حول NNsight، مما يوفر واجهة موحدة لتحليل المحولات مع الحفاظ على تنفيذات HuggingFace الأصلية. تتيح هذه الأداة للباحثين كتابة كود التدخل مرة واحدة وتطبيقه عبر أكثر من 50 نموذجًا متنوعًا من 16 عائلة معمارية، مما يسهل الاختبارات الشاملة للتفسير.

El artículo presenta nnterp, una nueva herramienta diseñada para mejorar la investigación sobre la interpretabilidad mecanicista de los modelos de transformadores. Los métodos actuales enfrentan desafíos en la estandarización y precisión numérica al analizar diferentes arquitecturas. nnterp actúa como un envoltorio ligero alrededor de NNsight, proporcionando una interfaz unificada para el análisis de transformadores mientras mantiene las implementaciones originales de HuggingFace. Permite a los investigadores escribir código de intervención una vez y aplicarlo a más de 50 variantes de modelos …

L'article présente nnterp, un nouvel outil conçu pour améliorer la recherche sur l'interprétabilité mécaniste des modèles de transformateurs. Les méthodes actuelles rencontrent des défis en matière de standardisation et de précision numérique lors de l'analyse de différentes architectures. nnterp agit comme un wrapper léger autour de NNsight, offrant une interface unifiée pour l'analyse des transformateurs tout en maintenant les implémentations originales de HuggingFace. Il permet aux chercheurs d'écrire un code d'intervention une fois et de l'appliquer à plus de 50 variantes de modèles proven…

The article discusses nnterp, a new tool designed to enhance mechanistic interpretability research for transformer models. Current methods face challenges in standardization and numerical accuracy when analyzing different architectures. nnterp serves as a lightweight wrapper around NNsight, providing a unified interface for transformer analysis while maintaining the original HuggingFace implementations. It allows researchers to write intervention code once and apply it across over 50 model variants from 16 architecture families, facilitating comprehensive interpretability testing.

Stratified Knowledge-Density Super-Network for Scalable Vision Transformers

Was this article worth reading? Share it