arXiv:2511.12181v1 Announce Type: cross 
Abstract: Autoregressive (AR) approaches, which represent images as sequences of discrete tokens from a finite codebook, have achieved remarkable success in image generation. However, the quantization process and the limited codebook size inevitably discard fine-grained information, placing bottlenecks on fidelity. Motivated by this limitation, recent studies have explored autoregressive modeling in continuous latent spaces, which offers higher generation quality. Yet, unlike discrete tokens constrained by a fixed codebook, continuous representations lie in a vast and unstructured space, posing significant challenges for efficient autoregressive modeling. To address these challenges, we introduce MixAR, a novel framework that leverages mixture training paradigms to inject discrete tokens as prior guidance for continuous AR modeling. MixAR is a factorized formulation that leverages discrete tokens as prior guidance for continuous autoregressive prediction. We investigate several discrete-continuous mixture strategies, including self-attention (DC-SA), cross-attention (DC-CA), and a simple approach (DC-Mix) that replaces homogeneous mask tokens with informative discrete counterparts. Moreover, to bridge the gap between ground-truth training tokens and inference tokens produced by the pre-trained AR model, we propose Training-Inference Mixture (TI-Mix) to achieve consistent training and generation distributions. In our experiments, we demonstrate a favorable balance of the DC-Mix strategy between computational efficiency and generation fidelity, and consistent improvement of TI-Mix.

تم تقديم MixAR كإطار جديد لتحسين توليد الصور من خلال النمذجة الذاتية الانحدار (AR). غالبًا ما تفقد الأساليب التقليدية AR، التي تستخدم رموزًا منفصلة من كتاب رموز محدود، التفاصيل الدقيقة بسبب عملية التكميم. وقد اتجهت الدراسات الحديثة نحو الفضاءات الكامنة المستمرة لتحسين الجودة، لكن هذه الفضاءات تمثل تحديات كبيرة للنمذجة الذاتية الانحدار الفعالة. يتناول MixAR هذه التحديات من خلال دمج الرموز المنفصلة كإرشادات سابقة، مما يسهل نمذجة AR المستمرة بشكل أفضل وقد يؤدي إلى زيادة الدقة في الصور المولدة.

MixAR es un nuevo marco introducido para mejorar la generación de imágenes a través de la modelización autorregresiva (AR). Los enfoques AR tradicionales, que utilizan tokens discretos de un código limitado, a menudo pierden detalles finos debido a la cuantización. Los avances recientes se han orientado hacia espacios latentes continuos para una mejor calidad, pero estos espacios presentan desafíos para una modelización eficiente. MixAR aborda estos problemas al integrar tokens discretos como guía previa, facilitando una mejor modelización AR continua y potencialmente llevando a una mayor fide…

MixAR est un nouveau cadre introduit pour améliorer la génération d'images par le biais de la modélisation autorégressive (AR). Les approches AR traditionnelles, qui utilisent des tokens discrets d'un codebook limité, perdent souvent des détails fins en raison de la quantification. Les avancées récentes se sont orientées vers des espaces latents continus pour une meilleure qualité, mais ces espaces posent des défis pour une modélisation efficace. MixAR aborde ces problèmes en intégrant des tokens discrets comme guide préalable, facilitant ainsi une meilleure modélisation AR continue et menant …

MixAR is a new framework introduced to enhance image generation through autoregressive (AR) modeling. Traditional AR approaches, which utilize discrete tokens from a limited codebook, often lose fine-grained details due to quantization. Recent advancements have shifted towards continuous latent spaces for improved quality, but these spaces present challenges for efficient modeling. MixAR addresses these issues by integrating discrete tokens as prior guidance, facilitating better continuous AR modeling and potentially leading to higher fidelity in generated images.

MixAR: Mixture Autoregressive Image Generation

The story of the Ghost in the Shell’s main villain the Puppet Master hinted at a future where governments use hackers for espionage, at a time when most of the world had never connected to the internet.

الأنمي الكلاسيكي 'Ghost in the Shell' يقدم شخصية Puppet Master، التي تتنبأ بمستقبل تستخدم فيه الحكومات القراصنة للتجسس. ظهرت هذه التنبؤات في وقت كانت فيه غالبية سكان العالم لم تتصل بعد بالإنترنت، مما يبرز رؤية العرض لمشكلات الأمن السيبراني.

El clásico anime 'Ghost in the Shell' presenta al Puppet Master, un personaje que anticipa un futuro en el que los gobiernos utilizan hackers para el espionaje. Esta predicción surgió en un momento en que la mayoría de la población mundial aún no estaba conectada a Internet, destacando la previsión del programa sobre los problemas de ciberseguridad.

L'anime classique 'Ghost in the Shell' présente le Puppet Master, un personnage qui préfigure un avenir où les gouvernements utilisent des hackers pour l'espionnage. Cette prédiction est survenue à une époque où la majorité de la population mondiale n'était pas encore connectée à Internet, soulignant la prévoyance de l'émission concernant les problèmes de cybersécurité.

The classic anime 'Ghost in the Shell' features the Puppet Master, a character that foreshadows a future where governments utilize hackers for espionage. This prediction emerged at a time when the majority of the global population had yet to connect to the internet, highlighting the show's foresight regarding cybersecurity issues.

How the classic anime ‘Ghost in the Shell’ predicted the future of cybersecurity 30 years ago

<p>Text-to-image diffusion models have become the workhorses of generative imaging. They can paint photorealistic scenes, mimic art styles, and blend concepts in ways that were science fiction a few years ago. Yet they stumble embarrassingly on a skill that even small children master: basic spatial reasoning.</p>

<p>Ask a state-of-the-art model for “a dog to the right of a teddy bear” and you often get:</p>

<ul>
<li>The dog on the left</li>
<li>One of the objects missing</li>
<li>Or a bizarre hybrid where dog and teddy are fused into a single creature</li>
</ul>

<p><a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F49rtb08366xdl284o4z0.jpg" class="article-body-image-wrapper"><img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F49rtb08366xdl284o4z0.jpg" alt=" " width="800" height="532"></a></p>

<p>These failures become more severe for unusual compositions like “a giraffe above an airplane”. Traditional fixes range from expensive fine-tuning to brittle, hand-written loss functions at inference time—but both options come with significant downsides.</p>

<p>NVIDIA’s Learn-to-Steer framework (accepted to WACV 2026) proposes a different path: instead of hard-coding spatial rules or retraining the entire model, it learns a data-driven objective that can “steer” diffusion at inference time. The method reads the model’s own cross-attention maps, trains a lightweight classifier to detect spatial relations, and then uses that classifier’s gradient as a learned loss to nudge the generation towards layouts that match the prompt.</p>

<p>In this blog, we’ll unpack:</p>

<ul>
<li>What makes spatial reasoning so fragile in current diffusion models</li>
<li>How Learn-to-Steer learns spatial constraints from the model itself</li>
<li>How it steers images during generation without changing model weights</li>
<li>The top gains on spatial benchmarks like GenEval and T2I-CompBench</li>
<li>The trade-offs in compute cost and generality, and what this implies for future generative systems</li>
</ul>

<h1>
  
  
  Why Spatial Reasoning Fails in Text-to-Image Diffusion
</h1>

<h2>
  
  
  What Makes Spatial Relations So Difficult for Diffusion Models?
</h2>

<p>Modern diffusion models (e.g., Stable Diffusion, Flux) are excellent at what should appear in an image—objects, styles, textures—but much less reliable at where those objects should be.</p>

<p>Several factors contribute:</p>

<h3>
  
  
  Weak supervision of spatial language
</h3>

<ul>
<li>Training data rarely comes with precise annotations like “object A is left of object B”.
</li>
<li>Captions often describe content loosely, so phrases like “on top of” or “to the right of” are under-specified.</li>
</ul>

<h3>
  
  
  Entangled visual concepts
</h3>

<ul>
<li>When two objects frequently co-occur, models may treat them as a single visual blob.</li>
<li>This leads to object fusion, where a “cat on a bookshelf” becomes a cat-bookshelf chimera.</li>
</ul>

<h3>
  
  
  Benchmark saturation without spatial coverage
</h3>

<ul>
<li>Many standard text-to-image benchmarks emphasize realism and style, not relational accuracy.</li>
<li>Models can score highly while still being spatially confused.</li>
</ul>

<p>Empirical studies confirm three recurring failure modes on spatial benchmarks:</p>

<ul>
<li>Incorrect placement: Objects appear in the wrong relative position.</li>
<li>Missing entities: One or more requested objects never appear.</li>
<li>Merged entities: Two objects get mashed into a single, incoherent form.</li>
</ul>

<p>The model “knows” the objects you asked for, but it doesn’t reliably understand where to place them.</p>

<h1>
  
  
  Why Fine-Tuning and Handcrafted Losses Are Not Enough
</h1>

<p>Two broad strategies have tried to patch this gap:</p>

<h2>
  
  
  Fine-tuning for spatial awareness
</h2>

<ul>
<li>Retrain the diffusion model on datasets with explicit layouts or spatial annotations.</li>
<li>Methods like COMPASS show that this can significantly improve spatial accuracy.</li>
<li>But this comes at a cost: expensive retraining, sensitivity to dataset bias, and often regressions in other capabilities such as color fidelity or counting.</li>
</ul>

<h2>
  
  
  Handcrafted test-time losses
</h2>

<ul>
<li>At inference, inject extra loss terms that penalize spatial errors (e.g., overlapping activation maps, incorrect ordering).</li>
<li>These losses must be manually designed to approximate relations like “left of” or “above”.</li>
<li>In practice, these heuristics are fragile, often over-fitting simple cases and failing on more complex layouts.</li>
</ul>

<p>In short, we’ve lacked a solution that is:</p>

<ul>
<li>Data-driven rather than rule-based</li>
<li>Plug-and-play at inference time (no full retraining)</li>
<li>Targeted enough to improve spatial reasoning without damaging other strengths</li>
</ul>

<p>This is where Learn-to-Steer enters.</p>

<h1>
  
  
  How Learn-to-Steer Works: Data-Driven Steering at Inference
</h1>

<h2>
  
  
  How Cross-Attention Maps Provide a Spatial Signal
</h2>

<p>During diffusion, at each denoising step, the model computes cross-attention maps that connect text tokens to image regions. For a prompt like “a dog to the right of a teddy bear”, you can think of:</p>

<ul>
<li>One set of attention maps for “dog”</li>
<li>Another set for “teddy bear”</li>
<li>Additional context around words like “right” or “of”</li>
</ul>

<p>These maps form a rich, high-dimensional signal describing where in the image the model currently believes each word should manifest. Prior work has used cross-attention to locate objects or edit images; Learn-to-Steer goes further by treating them as a feature space in which spatial relations can be learned.</p>

<h2>
  
  
  How a Relation Classifier Becomes a Learned Loss
</h2>

<p>The core idea of Learn-to-Steer is to train a small relation classifier that takes cross-attention maps for two objects and predicts the spatial relation between them (left-of, right-of, above, below, etc.).</p>

<p>The pipeline looks like this:</p>

<h3>
  
  
  Collect supervision
</h3>

<ul>
<li>Use images where the true relation between object A and object B is known (from datasets like GQA and synthetic layouts).</li>
<li>For each image, invert it through the diffusion model with a descriptive prompt to recover cross-attention maps for the relevant tokens.</li>
</ul>

<p><a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fi9dbjsdc4c8yjz2r88k4.jpg" class="article-body-image-wrapper"><img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fi9dbjsdc4c8yjz2r88k4.jpg" alt=" " width="800" height="446"></a></p>

<h3>
  
  
  Train a classifier on attention patterns
</h3>

<ul>
<li>Input: attention maps for object A and object B.</li>
<li>Output: predicted relation (e.g., “A is left of B”).</li>
</ul>

<p>Naively, however, this leads to a subtle but serious issue: relation leakage.</p>

<h2>
  
  
  How Dual Inversion Solves the “Relation Leakage” Problem
</h2>

<p>If you always invert images with a correct prompt (e.g., “a dog to the left of a cat”), hints about the word “left” can leak into the attention patterns. A naïve classifier might then “cheat” by reading out linguistic artefacts instead of learning genuine visual geometry.</p>

<p>To prevent this, Learn-to-Steer uses a dual inversion strategy:</p>

<ul>
<li>For each image with a true relation (say, dog left of cat), create two prompts:

<ul>
<li>A positive prompt with the correct relation (“dog to the left of a cat”).</li>
<li>A negative prompt with an incorrect relation (“dog above a cat”).</li>
</ul>


</li>

<li>Run inversion with both prompts, obtaining two sets of attention maps.</li>

<li>Label both sets with the true relation (left-of), because that is what the image actually depicts.</li>

</ul>

<p>The classifier sees pairs of attention maps that share the same underlying geometry but differ in the relation words used in the prompt. To succeed, it must ignore the unreliable linguistic cue and zero in on the geometric evidence in the attention patterns. This breaks the leakage shortcut and yields a classifier that actually understands “left-of” in terms of where things appear in the model’s internal vision.</p>

<p>To improve robustness, NVIDIA combines:</p>

<ul>
<li>Real images (complex, natural scenes)</li>
<li>Synthetic images (simpler, cleaner attention patterns akin to generation scenarios)</li>
</ul>

<h1>
  
  
  How Learn-to-Steer Guides Images During Generation
</h1>

<h2>
  
  
  Step-by-Step: From Prompt to Steered Latent
</h2>

<p>Once the relation classifier is trained, Learn-to-Steer uses it at inference time as a learned objective:</p>

<h3>
  
  
  Parse the spatial prompt
</h3>

<ul>
<li>Extract subject, relation, and object from the text (e.g., subject = dog, relation = right-of, object = teddy bear).</li>
</ul>

<h3>
  
  
  Run diffusion as usual—but with checkpoints
</h3>

<ul>
<li>As the model denoises latent noise into an image, periodically extract cross-attention maps for the subject and object tokens.</li>
</ul>

<h3>
  
  
  Evaluate spatial correctness
</h3>

<ul>
<li>Feed these maps into the relation classifier, which outputs a probability distribution over relations.</li>
<li>Compare this distribution to the desired relation from the prompt, and compute a loss (e.g., cross-entropy).</li>
</ul>

<h3>
  
  
  Backpropagate into the latent
</h3>

<ul>
<li>Compute the gradient of this loss with respect to the latent representation at that timestep.</li>
<li>Nudge the latent in the direction that increases the classifier’s confidence in the correct relation.</li>
</ul>

<h3>
  
  
  Continue the diffusion process
</h3>

<ul>
<li>Let the denoising proceed from the adjusted latent.</li>
<li>Repeat this steering a number of times (often during the earlier half of the diffusion steps).</li>
</ul>

<h2>
  
  
  Support for Multiple Architectures and Relations
</h2>

<p><a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F578a9bjc7gmtemh0jbsj.jpg" class="article-body-image-wrapper"><img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F578a9bjc7gmtemh0jbsj.jpg" alt=" " width="800" height="477"></a></p>

<p>A key advantage of Learn-to-Steer is that it’s architecture-agnostic:</p>

<ul>
<li>It has been demonstrated on both UNet-based models (like Stable Diffusion 1.4/2.1) and MMDiT-style models (like Flux).</li>
<li>The only requirement is access to a text-image alignment signal (cross-attention or similar).</li>
</ul>

<p>It can also handle prompts with multiple constraints, such as:</p>

<p>“A frog above a sneaker below a teapot.”</p>

<p>Here, Learn-to-Steer alternates attention between relations:</p>

<ul>
<li>At one timestep, optimize the frog–sneaker relation.</li>
<li>At another, optimize the sneaker–teapot relation.</li>
</ul>

يهدف Learn-to-Steer من NVIDIA إلى معالجة قيد كبير في نماذج الانتشار من النص إلى الصورة، التي تعاني من ضعف في التفكير المكاني الأساسي. يمكن لهذه النماذج إنشاء صور فوتوغرافية واقعية، لكنها غالبًا ما تضع الأشياء في غير موضعها، مثل وضع كلب على اليسار بدلاً من اليمين بجانب دمية دب. تهدف هذه الخطوة إلى تحسين دقة الصور المولدة من خلال تعزيز الفهم المكاني.

El Learn-to-Steer de NVIDIA busca abordar una limitación significativa en los modelos de difusión de texto a imagen, que luchan con el razonamiento espacial básico. Estos modelos pueden crear imágenes fotorealistas, pero a menudo colocan mal los objetos en relación entre sí, como poner un perro a la izquierda de un oso de peluche en lugar de a la derecha. Este avance tiene como objetivo mejorar la precisión de las imágenes generadas al mejorar la comprensión espacial.

Le Learn-to-Steer de NVIDIA vise à résoudre une limitation importante des modèles de diffusion texte-image, qui ont du mal avec le raisonnement spatial de base. Ces modèles peuvent créer des images photoréalistes mais placent souvent mal les objets les uns par rapport aux autres, comme mettre un chien à gauche d'un ours en peluche au lieu de la droite. Cette avancée vise à améliorer l'exactitude des images générées en renforçant la compréhension spatiale.

NVIDIA's Learn-to-Steer is set to address a significant limitation in text-to-image diffusion models, which struggle with basic spatial reasoning. These models can create photorealistic images but often misplace objects in relation to one another, such as placing a dog to the left of a teddy bear instead of the right. This advancement aims to enhance the accuracy of generated images by improving spatial understanding.

What Is Learn-to-Steer? NVIDIA’s 2025 Spatial Fix for Text-to-Image Diffusion

<p>The $5tn firm handily beat expectations, but analysts are awaiting projections for future demand for firm’s AI chips</p><p>Nvidia shares are rising in after-market trading after the company posted third quarter earnings that beat Wall Street estimates.<strong> </strong>All eyes were on Nvidia, the bellwether for the AI industry and the most valuable publicly traded company in the world, as analysts and investors hoped the chipmaker’s third-quarter earnings would assuage concerns about whether the high-flying valuations of AI firms have peaked.</p><p>“Blackwell sales are off the charts, and cloud GPUs are sold out,” said Jensen Huang, founder and CEO of Nvidia in a press release. “Compute demand keeps accelerating and compounding across training and inference – each growing exponentially. We’ve entered the virtuous cycle of AI. The AI ecosystem is scaling fast – with more new foundation model makers, more AI startups, across more industries, and in more countries. AI is going everywhere, doing everything, all at once.”</p> <a href="https://www.theguardian.com/technology/2025/nov/19/nvidia-earning-report">Continue reading...</a>

تجاوزت شركة إنفيديا توقعات وول ستريت مع نتائجها للربع الثالث، مما أظهر طلبًا قويًا على شرائح الذكاء الاصطناعي الخاصة بها. ارتفعت أسهم الشركة في التداول بعد السوق، مما يعكس ثقة المستثمرين وسط مخاوف بشأن تقييم سوق الذكاء الاصطناعي. وأبرز الرئيس التنفيذي جينسن هوانغ مبيعات قياسية ونظامًا بيئيًا سريع التوسع في مجال الذكاء الاصطناعي، مما يشير إلى نظرة إيجابية لمستقبل الشركة.

Nvidia superó las expectativas de Wall Street con sus ganancias del tercer trimestre, mostrando una fuerte demanda por sus chips de IA. Las acciones de la compañía aumentaron en el comercio posterior al cierre, reflejando la confianza de los inversores en medio de preocupaciones sobre la valoración del mercado de IA. El CEO Jensen Huang destacó las ventas récord y un ecosistema de IA en rápida expansión, indicando una perspectiva positiva para el futuro de la empresa.

Nvidia a dépassé les attentes de Wall Street avec ses résultats du troisième trimestre, montrant une forte demande pour ses puces d'IA. Les actions de l'entreprise ont augmenté lors des échanges après la clôture, reflétant la confiance des investisseurs face aux préoccupations concernant la valorisation du marché de l'IA. Le PDG Jensen Huang a souligné des ventes record et un écosystème IA en pleine expansion, indiquant une perspective positive pour l'avenir de l'entreprise.

Nvidia exceeded Wall Street expectations with its third-quarter earnings, showcasing strong demand for its AI chips. The company's shares rose in after-market trading, reflecting investor confidence amid concerns about the AI market's valuation. CEO Jensen Huang highlighted record sales and a rapidly expanding AI ecosystem, indicating a positive outlook for the company's future.

‘AI is going everywhere, doing everything:’ Nvidia beats Wall Street estimates amid market selloff and AI bubble fears

The SanDisk ExtremeFit USB-C flash drive is barely three grams, but offers 1TB of external storage and impressive speeds.

I refused to believe this coin-sized gadget was a storage drive, until I tried it for myself

<p>Swift 6.3 is bringing significant enhancements to Embedded Swift, the subset of Swift designed for resource-constrained environments like microcontrollers. Here's what's new:</p>

<h2>
  
  
  Key Improvements
</h2>

<h3>
  
  
  Libraries &amp; Standard Library
</h3>

<ul>
<li>
<strong>Floating-point printing</strong>: The <code>description</code> and <code>debugDescription</code> properties now work for Float, Double, and other floating-point types with a new all-Swift implementation</li>
<li>
<strong>Better diagnostics</strong>: New <code>EmbeddedRestrictions</code> diagnostic group warns about unsupported language constructs</li>
<li>
<strong>Swift MMIO 0.1.x</strong>: Includes code generation from SVD files and improved debugging with SVD2LLDB plugin</li>
</ul>

<h3>
  
  
  C Interoperability
</h3>

<ul>
<li>
<strong><code>@c</code> attribute</strong>: Define C-compatible functions and enums (from SE-0495)
</li>
</ul>

<div class="highlight js-code-highlight">
<pre class="highlight swift"><code><span class="kd">@c</span><span class="p">(</span><span class="kt">MyLib_initialize</span><span class="p">)</span>
<span class="kd">public</span> <span class="kd">func</span> <span class="nf">initialize</span><span class="p">()</span> <span class="p">{</span> <span class="o">...</span> <span class="p">}</span>
</code></pre>

</div>



<ul>
<li>
<strong>Improved type matching</strong>: Better tolerance for mismatching C signatures, eliminating cryptic deserialization errors</li>
</ul>

<h3>
  
  
  Debugging
</h3>

<ul>
<li>
<strong>Enhanced LLDB support</strong>: Better value printing for Embedded Swift types</li>
<li>
<strong>Core dump inspection</strong>: Dictionary, Array, and other common types now inspectable without a live process</li>
<li>
<strong>ARMv7m exception unwinding</strong>: Complete backtraces through exception frames</li>
</ul>

<h3>
  
  
  Linking &amp; Compilation
</h3>

<ul>
<li>
<strong><code>@section</code> and <code>@used</code> attributes</strong>: Control where globals are emitted and ensure symbols aren't stripped (SE-0492)</li>
<li>
<strong>Weak symbol definitions</strong>: Fixes duplicate symbol errors in diamond dependencies</li>
<li>
<strong><code>@export</code> attribute</strong>: Better control over function visibility (SE-0497)</li>
</ul>




<p><em>Want to dive deeper? Read the <a href="https://www.swift.org/blog/embedded-swift-improvements-coming-in-swift-6.3/" rel="noopener noreferrer">full announcement</a> on Swift.org</em></p>

تقدم Swift 6.3 تحسينات كبيرة على Embedded Swift، مما يعزز وظيفته في البيئات ذات الموارد المحدودة مثل المتحكمات الدقيقة. تشمل التحسينات الرئيسية قدرات جديدة لطباعة الأعداد العشرية، وتشخيصات أفضل مع مجموعة EmbeddedRestrictions، وإدخال Swift MMIO 0.1.x لتوليد الشيفرة وتصحيح الأخطاء.

Swift 6.3 presenta mejoras significativas en Embedded Swift, aumentando su funcionalidad para entornos con recursos limitados como microcontroladores. Las mejoras clave incluyen nuevas capacidades de impresión de números de punto flotante, mejores diagnósticos con el grupo EmbeddedRestrictions y la introducción de Swift MMIO 0.1.x para la generación de código y la depuración.

Swift 6.3 apporte des améliorations significatives à Embedded Swift, renforçant sa fonctionnalité pour les environnements à ressources limitées comme les microcontrôleurs. Les principales améliorations comprennent de nouvelles capacités d'impression de nombres à virgule flottante, de meilleurs diagnostics avec le groupe EmbeddedRestrictions, et l'introduction de Swift MMIO 0.1.x pour la génération de code et le débogage.

Swift 6.3 introduces significant upgrades to Embedded Swift, enhancing its functionality for resource-constrained environments like microcontrollers. Key improvements include new floating-point printing capabilities, better diagnostics with the EmbeddedRestrictions group, and the introduction of Swift MMIO 0.1.x for code generation and debugging.

Embedded Swift Gets Major Upgrades in Swift 6.3

<a href="https://www.techspot.com/news/110317-judge-dismisses-lawsuit-twice-due-alleged-deepfake-video.html" target="_blank"><img src="https://www.techspot.com/images2/news/ts3_thumbs/2025/11/2025-11-19-ts3_thumbs-252.jpg" width="800" height="560" style="padding: 15px 0" title="Judge dismisses lawsuit twice due to alleged deepfake video testimony" /></a><br />A California housing dispute is getting media attention over allegations that lawyers presented a deepfake video as witness testimony. NBC News reports that Judge Victoria Kolakowski became suspicious after the supposed witness showed signs that something was not right, including a monotone voice, fuzzy facial features, and repeated facial expressions....<br /><br /><a href="https://www.techspot.com/news/110317-judge-dismisses-lawsuit-twice-due-alleged-deepfake-video.html">Read Entire Article</a><br /><br />

تجذب نزاع سكني في كاليفورنيا الانتباه الإعلامي بعد ظهور مزاعم بأن المحامين قدموا فيديو مزيف كدليل شهود. أعربت القاضية فيكتوريا كولاكوفسكي عن شكوكها بشأن الفيديو، مشيرة إلى صوت الشاهد الأحادي، وملامح الوجه غير الواضحة، وتكرار التعبيرات. أدى ذلك إلى رفض الدعوى القضائية مرتين.

Una disputa de vivienda en California ha llamado la atención de los medios tras las alegaciones de que los abogados presentaron un video deepfake como testimonio. La jueza Victoria Kolakowski expresó su escepticismo sobre el video, señalando la voz monótona del testigo, rasgos faciales borrosos y expresiones repetitivas. Esto llevó al desestimado de la demanda en dos ocasiones.

Un litige immobilier en Californie suscite l'attention des médias après des allégations selon lesquelles des avocats auraient présenté une vidéo deepfake comme témoignage. La juge Victoria Kolakowski a exprimé des doutes sur la vidéo, notant la voix monotone du témoin, des traits faciaux flous et des expressions répétitives. Cela a conduit à l'annulation de la poursuite à deux reprises.

A California housing dispute has drawn attention after allegations surfaced that lawyers presented a deepfake video as witness testimony. Judge Victoria Kolakowski expressed skepticism about the video, noting the witness's monotone voice, unclear facial features, and repetitive expressions. This led to the dismissal of the lawsuit on two occasions.

Judge dismisses lawsuit twice due to alleged deepfake video testimony

Virtual keyboards are a frequent source of frustration for augmented reality (AR) users. The virtual surfaces are slow and error prone, and raising an arm to type on them can cause muscle strain known as "gorilla arm."

تسبب لوحات المفاتيح الافتراضية في الواقع المعزز (AR) إحباطًا متكررًا للمستخدمين بسبب بطئها وارتفاع معدل الأخطاء. يعاني المستخدمون من عدم الراحة، المعروف باسم 'ذراع الغوريلا'، نتيجة رفع أذرعهم للكتابة على هذه الأسطح الافتراضية.

Los teclados virtuales en la realidad aumentada (RA) a menudo frustran a los usuarios debido a su lenta respuesta y alta tasa de errores. Los usuarios experimentan incomodidad, comúnmente conocida como 'brazo de gorila', al levantar los brazos para escribir en estas superficies virtuales.

Les claviers virtuels en réalité augmentée (RA) frustrent souvent les utilisateurs en raison de leur lenteur et de leur taux d'erreur élevé. Les utilisateurs ressentent un inconfort, communément appelé 'bras de gorille', en levant les bras pour taper sur ces surfaces virtuelles.

Virtual keyboards in augmented reality (AR) often frustrate users due to their slow response and high error rates. Users experience discomfort, commonly referred to as 'gorilla arm,' from raising their arms to type on these virtual surfaces.

New augmented reality tech can turn any surface into keyboard

arXiv:2511.13714v1 Announce Type: cross 
Abstract: The Segment Anything Model (SAM) family has become a widely adopted vision foundation model, but its ability to control segmentation granularity remains limited. Users often need to refine results manually - by adding more prompts or selecting from pre-generated masks - to achieve the desired level of detail. This process can be ambiguous, as the same prompt may correspond to several plausible masks, and collecting dense annotations across all granularities is prohibitively expensive, making supervised solutions infeasible. To address this limitation, we introduce UnSAMv2, which enables segment anything at any granularity without human annotations. UnSAMv2 extends the divide-and-conquer strategy of UnSAM by discovering abundant mask-granularity pairs and introducing a novel granularity control embedding that enables precise, continuous control over segmentation scale. Remarkably, with only $6$K unlabeled images and $0.02\%$ additional parameters, UnSAMv2 substantially enhances SAM-2, achieving segment anything at any granularity across interactive, whole-image, and video segmentation tasks. Evaluated on over $11$ benchmarks, UnSAMv2 improves $\text{NoC}_{90}$ (5.69 $\rightarrow$ 4.75), 1-IoU (58.0 $\rightarrow$ 73.1), and $\text{AR}_{1000}$ (49.6 $\rightarrow$ 68.3), showing that small amounts of unlabeled data with a granularity-aware self-supervised learning method can unlock the potential of vision foundation models.

أصبح نموذج Segment Anything (SAM) شائعًا كنموذج أساسي للرؤية، لكنه يواجه صعوبة في التحكم في دقة التقسيم، مما يتطلب غالبًا من المستخدمين إجراء تعديلات يدوية. لمعالجة هذه المشكلة، تم تقديم UnSAMv2، الذي يمكّن من تقسيم أي شيء على أي دقة دون الحاجة إلى تعليقات بشرية. يوسع هذا النموذج استراتيجية التقسيم والفوز الخاصة بسابقه، UnSAM، من خلال اكتشاف أزواج وفيرة من الأقنعة والدقة وإدخال تضمين جديد للتحكم في الدقة يمكّن من التحكم الدقيق والمستمر في مقياس التقسيم. يظهر النموذج فعاليته باستخدام 6000 صورة غير مصنفة فقط.

El modelo Segment Anything (SAM) ha ganado popularidad como un modelo base de visión, pero su capacidad para controlar la granularidad de la segmentación es limitada, lo que a menudo requiere que los usuarios realicen ajustes manuales. Para abordar este desafío, se ha introducido UnSAMv2, que permite segmentar cualquier cosa a cualquier granularidad sin anotaciones humanas. Este modelo amplía la estrategia de dividir y conquistar de su predecesor, UnSAM, al descubrir numerosas parejas de máscaras y granularidades e introducir un nuevo embebido de control de granularidad para un manejo preciso …

Le modèle Segment Anything (SAM) est devenu populaire en tant que modèle de base en vision, mais il a des difficultés à contrôler la granularité de la segmentation, nécessitant souvent un raffinement manuel par les utilisateurs. Pour surmonter ce défi, UnSAMv2 a été introduit, permettant la segmentation à n'importe quelle granularité sans annotations humaines. Ce modèle s'appuie sur la stratégie de division et conquête de son prédécesseur, UnSAM, en identifiant de nombreuses paires de masques et de granularités et en mettant en œuvre un nouvel encodage de contrôle de granularité pour une gesti…

The Segment Anything Model (SAM) has gained popularity as a vision foundation model, but it struggles with controlling segmentation granularity, often requiring manual refinement by users. To overcome this challenge, UnSAMv2 has been introduced, allowing segmentation at any granularity without human annotations. This model builds on the divide-and-conquer strategy of its predecessor, UnSAM, by identifying numerous mask-granularity pairs and implementing a new granularity control embedding for precise segmentation scale management. The model demonstrates effectiveness with only 6,000 unlabeled …

UnSAMv2: Self-Supervised Learning Enables Segment Anything at Any Granularity

arXiv:2511.14268v1 Announce Type: cross 
Abstract: Heterogeneous porous materials play a crucial role in various engineering systems. Microstructure characterization and reconstruction provide effective means for modeling these materials, which are critical for conducting physical property simulations, structure-property linkage studies, and enhancing their performance across different applications. To achieve superior controllability and applicability with small sample sizes, we propose a statistically controllable microstructure reconstruction framework that integrates neural networks with sliced-Wasserstein metric. Specifically, our approach leverages local pattern distribution for microstructure characterization and employs a controlled sampling strategy to generate target distributions that satisfy given conditional parameters. A neural network-based model establishes the mapping from the input distribution to the target local pattern distribution, enabling microstructure reconstruction. Combinations of sliced-Wasserstein metric and gradient optimization techniques minimize the distance between these distributions, leading to a stable and reliable model. Our method can perform stochastic and controllable reconstruction tasks even with small sample sizes. Additionally, it can generate large-size (e.g. 512 and 1024) 3D microstructures using a chunking strategy. By introducing spatial location masks, our method excels at generating spatially heterogeneous and complex microstructures. We conducted experiments on stochastic reconstruction, controllable reconstruction, heterogeneous reconstruction, and large-size microstructure reconstruction across various materials. Comparative analysis through visualization, statistical measures, and physical property simulations demonstrates the effectiveness, providing new insights and possibilities for research on structure-property linkage and material inverse design.

تم اقتراح إطار عمل جديد لإعادة بناء الميكروستركشر للمواد المسامية غير المتجانسة، حيث يتم دمج الشبكات العصبية مع مقياس ووترستين المقطوع. تعزز هذه الطريقة من توصيف وإعادة بناء الميكروستركشر، وهما أمران أساسيان لنمذجة هذه المواد في التطبيقات الهندسية. من خلال استخدام توزيع الأنماط المحلية واستراتيجية أخذ عينات محكومة، يهدف الإطار إلى تحسين القابلية للتحكم والتطبيق في إعادة بناء الميكروستركشر، حتى مع أحجام عينات صغيرة.

Se ha propuesto un nuevo marco para la reconstrucción de la microestructura de materiales heterogéneos porosos, integrando redes neuronales con la métrica de Wasserstein cortada. Este enfoque mejora la caracterización y reconstrucción de la microestructura, que son esenciales para modelar materiales en aplicaciones de ingeniería. Al utilizar la distribución de patrones locales y una estrategia de muestreo controlado, el marco busca mejorar la controlabilidad y aplicabilidad de la reconstrucción de microestructuras, incluso con tamaños de muestra pequeños.

Un nouveau cadre pour la reconstruction de la microstructure des matériaux hétérogènes poreux a été proposé, intégrant des réseaux de neurones avec la métrique de Wasserstein tranchée. Cette approche améliore la caractérisation et la reconstruction de la microstructure, essentielles pour modéliser les matériaux dans les applications d'ingénierie. En utilisant la distribution des motifs locaux et une stratégie d'échantillonnage contrôlé, le cadre vise à améliorer la contrôlabilité et l'applicabilité de la reconstruction de la microstructure, même avec de petites tailles d'échantillons.

A new framework for reconstructing the microstructure of heterogeneous porous materials has been proposed, integrating neural networks with the sliced-Wasserstein metric. This approach enhances microstructure characterization and reconstruction, which are essential for modeling materials in engineering applications. By utilizing local pattern distribution and a controlled sampling strategy, the framework aims to improve the controllability and applicability of microstructure reconstruction, even with small sample sizes.

Statistically controllable microstructure reconstruction framework for heterogeneous materials using sliced-Wasserstein metric and neural networks

arXiv:2408.00540v4 Announce Type: replace-cross 
Abstract: Artificial Intelligence (AI) is being incorporated in several optimization, scheduling, orchestration as well as in native communication network functions. This paradigm shift results in increased energy consumption, however, quantifying the end-to-end energy consumption of adding intelligence to communication systems remains an open challenge since conventional energy consumption metrics focus on either communication, computation infrastructure, or model development. To address this, we propose a new metric, the Energy Cost of AI Lifecycle (eCAL) of an AI model in a system. eCAL captures the energy consumption throughout the development, deployment and utilization of an AI-model providing intelligence in a communication network by (i) analyzing the complexity of data collection and manipulation in individual components and (ii) deriving overall and per-bit energy consumption. We show that as a trained AI model is used more frequently for inference, its energy cost per inference decreases, since the fixed training energy is amortized over a growing number of inferences. For a simple case study we show that eCAL for 100 inferences is 2.73 times higher than for 1000 inferences. Additionally, we have developed a modular and extendable open-source simulation tool to enable researchers, practitioners, and engineers to calculate the end-to-end energy cost with various configurations and across various systems, ensuring adaptability to diverse use cases.

يتناول المقال دمج الذكاء الاصطناعي (AI) في شبكات الاتصال، مشيرًا إلى زيادة استهلاك الطاقة المرتبطة بهذا التحول. يقدم مقياسًا جديدًا يسمى تكلفة الطاقة لدورة حياة الذكاء الاصطناعي (eCAL)، والذي يقيس الطاقة المستخدمة خلال تطوير ونشر واستخدام نماذج الذكاء الاصطناعي في أنظمة الاتصال. تؤكد الدراسة على الحاجة إلى فهم شامل لمقاييس استهلاك الطاقة، التي تركز تقليديًا على الاتصال أو بنية الحوسبة أو تطوير النماذج.

El artículo aborda la integración de la inteligencia artificial (IA) en las redes de comunicación, destacando el aumento del consumo de energía asociado con este cambio. Presenta una nueva métrica llamada Costo Energético del Ciclo de Vida de la IA (eCAL), que cuantifica la energía utilizada durante el desarrollo, implementación y utilización de modelos de IA en sistemas de comunicación. El estudio enfatiza la necesidad de una comprensión integral de las métricas de consumo de energía, que tradicionalmente se centran en la comunicación, infraestructura de computación o desarrollo de modelos.

L'article traite de l'intégration de l'intelligence artificielle (IA) dans les réseaux de communication, soulignant l'augmentation de la consommation d'énergie associée à ce changement. Il présente un nouveau métrique appelé le Coût Énergétique du Cycle de Vie de l'IA (eCAL), qui quantifie l'énergie utilisée lors du développement, du déploiement et de l'utilisation des modèles d'IA dans les systèmes de communication. L'étude met en avant la nécessité d'une compréhension globale des métriques de consommation d'énergie, qui se concentrent traditionnellement sur la communication, l'infrastructure…

The article discusses the integration of Artificial Intelligence (AI) into communication networks, highlighting the increased energy consumption associated with this shift. It presents a new metric called the Energy Cost of AI Lifecycle (eCAL), which quantifies the energy used during the development, deployment, and utilization of AI models in communication systems. The study emphasizes the need for a comprehensive understanding of energy consumption metrics, which traditionally focus on communication, computation infrastructure, or model development.

The Energy Cost of Artificial Intelligence Lifecycle in Communication Networks

arXiv:2511.14465v1 Announce Type: new 
Abstract: Mechanistic interpretability research requires reliable tools for analyzing transformer internals across diverse architectures. Current approaches face a fundamental tradeoff: custom implementations like TransformerLens ensure consistent interfaces but require coding a manual adaptation for each architecture, introducing numerical mismatch with the original models, while direct HuggingFace access through NNsight preserves exact behavior but lacks standardization across models. To bridge this gap, we develop nnterp, a lightweight wrapper around NNsight that provides a unified interface for transformer analysis while preserving original HuggingFace implementations. Through automatic module renaming and comprehensive validation testing, nnterp enables researchers to write intervention code once and deploy it across 50+ model variants spanning 16 architecture families. The library includes built-in implementations of common interpretability methods (logit lens, patchscope, activation steering) and provides direct access to attention probabilities for models that support it. By packaging validation tests with the library, researchers can verify compatibility with custom models locally. nnterp bridges the gap between correctness and usability in mechanistic interpretability tooling.

يتناول المقال nnterp، وهي أداة جديدة مصممة لتعزيز البحث في التفسير الميكانيكي لنماذج المحولات. تواجه الأساليب الحالية تحديات في التوحيد والدقة العددية عند تحليل هياكل مختلفة. تعمل nnterp كغلاف خفيف حول NNsight، مما يوفر واجهة موحدة لتحليل المحولات مع الحفاظ على تنفيذات HuggingFace الأصلية. تتيح هذه الأداة للباحثين كتابة كود التدخل مرة واحدة وتطبيقه عبر أكثر من 50 نموذجًا متنوعًا من 16 عائلة معمارية، مما يسهل الاختبارات الشاملة للتفسير.

El artículo presenta nnterp, una nueva herramienta diseñada para mejorar la investigación sobre la interpretabilidad mecanicista de los modelos de transformadores. Los métodos actuales enfrentan desafíos en la estandarización y precisión numérica al analizar diferentes arquitecturas. nnterp actúa como un envoltorio ligero alrededor de NNsight, proporcionando una interfaz unificada para el análisis de transformadores mientras mantiene las implementaciones originales de HuggingFace. Permite a los investigadores escribir código de intervención una vez y aplicarlo a más de 50 variantes de modelos …

L'article présente nnterp, un nouvel outil conçu pour améliorer la recherche sur l'interprétabilité mécaniste des modèles de transformateurs. Les méthodes actuelles rencontrent des défis en matière de standardisation et de précision numérique lors de l'analyse de différentes architectures. nnterp agit comme un wrapper léger autour de NNsight, offrant une interface unifiée pour l'analyse des transformateurs tout en maintenant les implémentations originales de HuggingFace. Il permet aux chercheurs d'écrire un code d'intervention une fois et de l'appliquer à plus de 50 variantes de modèles proven…

The article discusses nnterp, a new tool designed to enhance mechanistic interpretability research for transformer models. Current methods face challenges in standardization and numerical accuracy when analyzing different architectures. nnterp serves as a lightweight wrapper around NNsight, providing a unified interface for transformer analysis while maintaining the original HuggingFace implementations. It allows researchers to write intervention code once and apply it across over 50 model variants from 16 architecture families, facilitating comprehensive interpretability testing.

MixAR: Mixture Autoregressive Image Generation

Was this article worth reading? Share it