arXiv:2511.12767v1 Announce Type: cross 
Abstract: Automatic sign language recognition plays a crucial role in bridging the communication gap between deaf communities and hearing individuals; however, most available datasets focus on American Sign Language. For Romanian Isolated Sign Language Recognition (RoISLR), no large-scale, standardized dataset exists, which limits research progress. In this work, we introduce a new corpus for RoISLR, named RoCoISLR, comprising over 9,000 video samples that span nearly 6,000 standardized glosses from multiple sources. We establish benchmark results by evaluating seven state-of-the-art video recognition models-I3D, SlowFast, Swin Transformer, TimeSformer, Uniformer, VideoMAE, and PoseConv3D-under consistent experimental setups, and compare their performance with that of the widely used WLASL2000 corpus. According to the results, transformer-based architectures outperform convolutional baselines; Swin Transformer achieved a Top-1 accuracy of 34.1%. Our benchmarks highlight the challenges associated with long-tail class distributions in low-resource sign languages, and RoCoISLR provides the initial foundation for systematic RoISLR research.

يقدم المقال RoCoISLR، وهو مجموعة بيانات جديدة للتعرف على لغة الإشارة المعزولة الرومانية، مما يعالج نقص مجموعات البيانات الكبيرة لهذه اللغة. تتكون RoCoISLR من أكثر من 9000 عينة فيديو وحوالي 6000 مصطلح موحد، وتهدف إلى تحسين التعرف التلقائي على لغة الإشارة، وهو أمر حيوي لتسهيل التواصل بين المجتمعات الصماء والأشخاص السامعين. تقيم الدراسة سبعة نماذج متقدمة للتعرف على الفيديو، وتكشف أن الهياكل المعتمدة على المحولات، وخاصة Swin Transformer، تتفوق على النماذج التقليدية المعتمدة على الالتفاف.

El artículo presenta RoCoISLR, un nuevo corpus para el reconocimiento de la lengua de señas aislada rumana, abordando la falta de conjuntos de datos a gran escala para este idioma. Con más de 9,000 muestras de video y casi 6,000 glosas estandarizadas, RoCoISLR busca mejorar el reconocimiento automático de la lengua de señas, que es vital para la comunicación entre comunidades sordas y oyentes. El estudio evalúa siete modelos avanzados de reconocimiento de video, revelando que las arquitecturas basadas en transformadores, especialmente el Swin Transformer, superan a los modelos convolucionales …

L'article présente RoCoISLR, un nouveau corpus pour la reconnaissance de la langue des signes isolée roumaine, répondant au manque de jeux de données à grande échelle pour cette langue. Comprenant plus de 9 000 échantillons vidéo et près de 6 000 glosses standardisées, RoCoISLR vise à améliorer la reconnaissance automatique de la langue des signes, essentielle pour la communication entre les personnes sourdes et entendantes. L'étude évalue sept modèles avancés de reconnaissance vidéo, révélant que les architectures basées sur des transformateurs, en particulier le Swin Transformer, surpassent …

The article introduces RoCoISLR, a new corpus for Romanian Isolated Sign Language Recognition, addressing the lack of large-scale datasets for this language. Comprising over 9,000 video samples and nearly 6,000 standardized glosses, RoCoISLR aims to enhance automatic sign language recognition, which is vital for communication between deaf and hearing individuals. The study evaluates seven advanced video recognition models, revealing that transformer-based architectures, particularly the Swin Transformer, achieve superior performance compared to traditional convolutional models.

RoCoISLR: A Romanian Corpus for Isolated Sign Language Recognition

The story of the Ghost in the Shell’s main villain the Puppet Master hinted at a future where governments use hackers for espionage, at a time when most of the world had never connected to the internet.

الأنمي الكلاسيكي 'Ghost in the Shell' يقدم شخصية Puppet Master، التي تتنبأ بمستقبل تستخدم فيه الحكومات القراصنة للتجسس. ظهرت هذه التنبؤات في وقت كانت فيه غالبية سكان العالم لم تتصل بعد بالإنترنت، مما يبرز رؤية العرض لمشكلات الأمن السيبراني.

El clásico anime 'Ghost in the Shell' presenta al Puppet Master, un personaje que anticipa un futuro en el que los gobiernos utilizan hackers para el espionaje. Esta predicción surgió en un momento en que la mayoría de la población mundial aún no estaba conectada a Internet, destacando la previsión del programa sobre los problemas de ciberseguridad.

L'anime classique 'Ghost in the Shell' présente le Puppet Master, un personnage qui préfigure un avenir où les gouvernements utilisent des hackers pour l'espionnage. Cette prédiction est survenue à une époque où la majorité de la population mondiale n'était pas encore connectée à Internet, soulignant la prévoyance de l'émission concernant les problèmes de cybersécurité.

The classic anime 'Ghost in the Shell' features the Puppet Master, a character that foreshadows a future where governments utilize hackers for espionage. This prediction emerged at a time when the majority of the global population had yet to connect to the internet, highlighting the show's foresight regarding cybersecurity issues.

How the classic anime ‘Ghost in the Shell’ predicted the future of cybersecurity 30 years ago

<p>Text-to-image diffusion models have become the workhorses of generative imaging. They can paint photorealistic scenes, mimic art styles, and blend concepts in ways that were science fiction a few years ago. Yet they stumble embarrassingly on a skill that even small children master: basic spatial reasoning.</p>

<p>Ask a state-of-the-art model for “a dog to the right of a teddy bear” and you often get:</p>

<ul>
<li>The dog on the left</li>
<li>One of the objects missing</li>
<li>Or a bizarre hybrid where dog and teddy are fused into a single creature</li>
</ul>

<p><a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F49rtb08366xdl284o4z0.jpg" class="article-body-image-wrapper"><img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F49rtb08366xdl284o4z0.jpg" alt=" " width="800" height="532"></a></p>

<p>These failures become more severe for unusual compositions like “a giraffe above an airplane”. Traditional fixes range from expensive fine-tuning to brittle, hand-written loss functions at inference time—but both options come with significant downsides.</p>

<p>NVIDIA’s Learn-to-Steer framework (accepted to WACV 2026) proposes a different path: instead of hard-coding spatial rules or retraining the entire model, it learns a data-driven objective that can “steer” diffusion at inference time. The method reads the model’s own cross-attention maps, trains a lightweight classifier to detect spatial relations, and then uses that classifier’s gradient as a learned loss to nudge the generation towards layouts that match the prompt.</p>

<p>In this blog, we’ll unpack:</p>

<ul>
<li>What makes spatial reasoning so fragile in current diffusion models</li>
<li>How Learn-to-Steer learns spatial constraints from the model itself</li>
<li>How it steers images during generation without changing model weights</li>
<li>The top gains on spatial benchmarks like GenEval and T2I-CompBench</li>
<li>The trade-offs in compute cost and generality, and what this implies for future generative systems</li>
</ul>

<h1>
  
  
  Why Spatial Reasoning Fails in Text-to-Image Diffusion
</h1>

<h2>
  
  
  What Makes Spatial Relations So Difficult for Diffusion Models?
</h2>

<p>Modern diffusion models (e.g., Stable Diffusion, Flux) are excellent at what should appear in an image—objects, styles, textures—but much less reliable at where those objects should be.</p>

<p>Several factors contribute:</p>

<h3>
  
  
  Weak supervision of spatial language
</h3>

<ul>
<li>Training data rarely comes with precise annotations like “object A is left of object B”.
</li>
<li>Captions often describe content loosely, so phrases like “on top of” or “to the right of” are under-specified.</li>
</ul>

<h3>
  
  
  Entangled visual concepts
</h3>

<ul>
<li>When two objects frequently co-occur, models may treat them as a single visual blob.</li>
<li>This leads to object fusion, where a “cat on a bookshelf” becomes a cat-bookshelf chimera.</li>
</ul>

<h3>
  
  
  Benchmark saturation without spatial coverage
</h3>

<ul>
<li>Many standard text-to-image benchmarks emphasize realism and style, not relational accuracy.</li>
<li>Models can score highly while still being spatially confused.</li>
</ul>

<p>Empirical studies confirm three recurring failure modes on spatial benchmarks:</p>

<ul>
<li>Incorrect placement: Objects appear in the wrong relative position.</li>
<li>Missing entities: One or more requested objects never appear.</li>
<li>Merged entities: Two objects get mashed into a single, incoherent form.</li>
</ul>

<p>The model “knows” the objects you asked for, but it doesn’t reliably understand where to place them.</p>

<h1>
  
  
  Why Fine-Tuning and Handcrafted Losses Are Not Enough
</h1>

<p>Two broad strategies have tried to patch this gap:</p>

<h2>
  
  
  Fine-tuning for spatial awareness
</h2>

<ul>
<li>Retrain the diffusion model on datasets with explicit layouts or spatial annotations.</li>
<li>Methods like COMPASS show that this can significantly improve spatial accuracy.</li>
<li>But this comes at a cost: expensive retraining, sensitivity to dataset bias, and often regressions in other capabilities such as color fidelity or counting.</li>
</ul>

<h2>
  
  
  Handcrafted test-time losses
</h2>

<ul>
<li>At inference, inject extra loss terms that penalize spatial errors (e.g., overlapping activation maps, incorrect ordering).</li>
<li>These losses must be manually designed to approximate relations like “left of” or “above”.</li>
<li>In practice, these heuristics are fragile, often over-fitting simple cases and failing on more complex layouts.</li>
</ul>

<p>In short, we’ve lacked a solution that is:</p>

<ul>
<li>Data-driven rather than rule-based</li>
<li>Plug-and-play at inference time (no full retraining)</li>
<li>Targeted enough to improve spatial reasoning without damaging other strengths</li>
</ul>

<p>This is where Learn-to-Steer enters.</p>

<h1>
  
  
  How Learn-to-Steer Works: Data-Driven Steering at Inference
</h1>

<h2>
  
  
  How Cross-Attention Maps Provide a Spatial Signal
</h2>

<p>During diffusion, at each denoising step, the model computes cross-attention maps that connect text tokens to image regions. For a prompt like “a dog to the right of a teddy bear”, you can think of:</p>

<ul>
<li>One set of attention maps for “dog”</li>
<li>Another set for “teddy bear”</li>
<li>Additional context around words like “right” or “of”</li>
</ul>

<p>These maps form a rich, high-dimensional signal describing where in the image the model currently believes each word should manifest. Prior work has used cross-attention to locate objects or edit images; Learn-to-Steer goes further by treating them as a feature space in which spatial relations can be learned.</p>

<h2>
  
  
  How a Relation Classifier Becomes a Learned Loss
</h2>

<p>The core idea of Learn-to-Steer is to train a small relation classifier that takes cross-attention maps for two objects and predicts the spatial relation between them (left-of, right-of, above, below, etc.).</p>

<p>The pipeline looks like this:</p>

<h3>
  
  
  Collect supervision
</h3>

<ul>
<li>Use images where the true relation between object A and object B is known (from datasets like GQA and synthetic layouts).</li>
<li>For each image, invert it through the diffusion model with a descriptive prompt to recover cross-attention maps for the relevant tokens.</li>
</ul>

<p><a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fi9dbjsdc4c8yjz2r88k4.jpg" class="article-body-image-wrapper"><img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fi9dbjsdc4c8yjz2r88k4.jpg" alt=" " width="800" height="446"></a></p>

<h3>
  
  
  Train a classifier on attention patterns
</h3>

<ul>
<li>Input: attention maps for object A and object B.</li>
<li>Output: predicted relation (e.g., “A is left of B”).</li>
</ul>

<p>Naively, however, this leads to a subtle but serious issue: relation leakage.</p>

<h2>
  
  
  How Dual Inversion Solves the “Relation Leakage” Problem
</h2>

<p>If you always invert images with a correct prompt (e.g., “a dog to the left of a cat”), hints about the word “left” can leak into the attention patterns. A naïve classifier might then “cheat” by reading out linguistic artefacts instead of learning genuine visual geometry.</p>

<p>To prevent this, Learn-to-Steer uses a dual inversion strategy:</p>

<ul>
<li>For each image with a true relation (say, dog left of cat), create two prompts:

<ul>
<li>A positive prompt with the correct relation (“dog to the left of a cat”).</li>
<li>A negative prompt with an incorrect relation (“dog above a cat”).</li>
</ul>


</li>

<li>Run inversion with both prompts, obtaining two sets of attention maps.</li>

<li>Label both sets with the true relation (left-of), because that is what the image actually depicts.</li>

</ul>

<p>The classifier sees pairs of attention maps that share the same underlying geometry but differ in the relation words used in the prompt. To succeed, it must ignore the unreliable linguistic cue and zero in on the geometric evidence in the attention patterns. This breaks the leakage shortcut and yields a classifier that actually understands “left-of” in terms of where things appear in the model’s internal vision.</p>

<p>To improve robustness, NVIDIA combines:</p>

<ul>
<li>Real images (complex, natural scenes)</li>
<li>Synthetic images (simpler, cleaner attention patterns akin to generation scenarios)</li>
</ul>

<h1>
  
  
  How Learn-to-Steer Guides Images During Generation
</h1>

<h2>
  
  
  Step-by-Step: From Prompt to Steered Latent
</h2>

<p>Once the relation classifier is trained, Learn-to-Steer uses it at inference time as a learned objective:</p>

<h3>
  
  
  Parse the spatial prompt
</h3>

<ul>
<li>Extract subject, relation, and object from the text (e.g., subject = dog, relation = right-of, object = teddy bear).</li>
</ul>

<h3>
  
  
  Run diffusion as usual—but with checkpoints
</h3>

<ul>
<li>As the model denoises latent noise into an image, periodically extract cross-attention maps for the subject and object tokens.</li>
</ul>

<h3>
  
  
  Evaluate spatial correctness
</h3>

<ul>
<li>Feed these maps into the relation classifier, which outputs a probability distribution over relations.</li>
<li>Compare this distribution to the desired relation from the prompt, and compute a loss (e.g., cross-entropy).</li>
</ul>

<h3>
  
  
  Backpropagate into the latent
</h3>

<ul>
<li>Compute the gradient of this loss with respect to the latent representation at that timestep.</li>
<li>Nudge the latent in the direction that increases the classifier’s confidence in the correct relation.</li>
</ul>

<h3>
  
  
  Continue the diffusion process
</h3>

<ul>
<li>Let the denoising proceed from the adjusted latent.</li>
<li>Repeat this steering a number of times (often during the earlier half of the diffusion steps).</li>
</ul>

<h2>
  
  
  Support for Multiple Architectures and Relations
</h2>

<p><a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F578a9bjc7gmtemh0jbsj.jpg" class="article-body-image-wrapper"><img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F578a9bjc7gmtemh0jbsj.jpg" alt=" " width="800" height="477"></a></p>

<p>A key advantage of Learn-to-Steer is that it’s architecture-agnostic:</p>

<ul>
<li>It has been demonstrated on both UNet-based models (like Stable Diffusion 1.4/2.1) and MMDiT-style models (like Flux).</li>
<li>The only requirement is access to a text-image alignment signal (cross-attention or similar).</li>
</ul>

<p>It can also handle prompts with multiple constraints, such as:</p>

<p>“A frog above a sneaker below a teapot.”</p>

<p>Here, Learn-to-Steer alternates attention between relations:</p>

<ul>
<li>At one timestep, optimize the frog–sneaker relation.</li>
<li>At another, optimize the sneaker–teapot relation.</li>
</ul>

يهدف Learn-to-Steer من NVIDIA إلى معالجة قيد كبير في نماذج الانتشار من النص إلى الصورة، التي تعاني من ضعف في التفكير المكاني الأساسي. يمكن لهذه النماذج إنشاء صور فوتوغرافية واقعية، لكنها غالبًا ما تضع الأشياء في غير موضعها، مثل وضع كلب على اليسار بدلاً من اليمين بجانب دمية دب. تهدف هذه الخطوة إلى تحسين دقة الصور المولدة من خلال تعزيز الفهم المكاني.

El Learn-to-Steer de NVIDIA busca abordar una limitación significativa en los modelos de difusión de texto a imagen, que luchan con el razonamiento espacial básico. Estos modelos pueden crear imágenes fotorealistas, pero a menudo colocan mal los objetos en relación entre sí, como poner un perro a la izquierda de un oso de peluche en lugar de a la derecha. Este avance tiene como objetivo mejorar la precisión de las imágenes generadas al mejorar la comprensión espacial.

Le Learn-to-Steer de NVIDIA vise à résoudre une limitation importante des modèles de diffusion texte-image, qui ont du mal avec le raisonnement spatial de base. Ces modèles peuvent créer des images photoréalistes mais placent souvent mal les objets les uns par rapport aux autres, comme mettre un chien à gauche d'un ours en peluche au lieu de la droite. Cette avancée vise à améliorer l'exactitude des images générées en renforçant la compréhension spatiale.

NVIDIA's Learn-to-Steer is set to address a significant limitation in text-to-image diffusion models, which struggle with basic spatial reasoning. These models can create photorealistic images but often misplace objects in relation to one another, such as placing a dog to the left of a teddy bear instead of the right. This advancement aims to enhance the accuracy of generated images by improving spatial understanding.

What Is Learn-to-Steer? NVIDIA’s 2025 Spatial Fix for Text-to-Image Diffusion

<p>The $5tn firm handily beat expectations, but analysts are awaiting projections for future demand for firm’s AI chips</p><p>Nvidia shares are rising in after-market trading after the company posted third quarter earnings that beat Wall Street estimates.<strong> </strong>All eyes were on Nvidia, the bellwether for the AI industry and the most valuable publicly traded company in the world, as analysts and investors hoped the chipmaker’s third-quarter earnings would assuage concerns about whether the high-flying valuations of AI firms have peaked.</p><p>“Blackwell sales are off the charts, and cloud GPUs are sold out,” said Jensen Huang, founder and CEO of Nvidia in a press release. “Compute demand keeps accelerating and compounding across training and inference – each growing exponentially. We’ve entered the virtuous cycle of AI. The AI ecosystem is scaling fast – with more new foundation model makers, more AI startups, across more industries, and in more countries. AI is going everywhere, doing everything, all at once.”</p> <a href="https://www.theguardian.com/technology/2025/nov/19/nvidia-earning-report">Continue reading...</a>

تجاوزت شركة إنفيديا توقعات وول ستريت مع نتائجها للربع الثالث، مما أظهر طلبًا قويًا على شرائح الذكاء الاصطناعي الخاصة بها. ارتفعت أسهم الشركة في التداول بعد السوق، مما يعكس ثقة المستثمرين وسط مخاوف بشأن تقييم سوق الذكاء الاصطناعي. وأبرز الرئيس التنفيذي جينسن هوانغ مبيعات قياسية ونظامًا بيئيًا سريع التوسع في مجال الذكاء الاصطناعي، مما يشير إلى نظرة إيجابية لمستقبل الشركة.

Nvidia superó las expectativas de Wall Street con sus ganancias del tercer trimestre, mostrando una fuerte demanda por sus chips de IA. Las acciones de la compañía aumentaron en el comercio posterior al cierre, reflejando la confianza de los inversores en medio de preocupaciones sobre la valoración del mercado de IA. El CEO Jensen Huang destacó las ventas récord y un ecosistema de IA en rápida expansión, indicando una perspectiva positiva para el futuro de la empresa.

Nvidia a dépassé les attentes de Wall Street avec ses résultats du troisième trimestre, montrant une forte demande pour ses puces d'IA. Les actions de l'entreprise ont augmenté lors des échanges après la clôture, reflétant la confiance des investisseurs face aux préoccupations concernant la valorisation du marché de l'IA. Le PDG Jensen Huang a souligné des ventes record et un écosystème IA en pleine expansion, indiquant une perspective positive pour l'avenir de l'entreprise.

Nvidia exceeded Wall Street expectations with its third-quarter earnings, showcasing strong demand for its AI chips. The company's shares rose in after-market trading, reflecting investor confidence amid concerns about the AI market's valuation. CEO Jensen Huang highlighted record sales and a rapidly expanding AI ecosystem, indicating a positive outlook for the company's future.

‘AI is going everywhere, doing everything:’ Nvidia beats Wall Street estimates amid market selloff and AI bubble fears

The SanDisk ExtremeFit USB-C flash drive is barely three grams, but offers 1TB of external storage and impressive speeds.

I refused to believe this coin-sized gadget was a storage drive, until I tried it for myself

<p>Swift 6.3 is bringing significant enhancements to Embedded Swift, the subset of Swift designed for resource-constrained environments like microcontrollers. Here's what's new:</p>

<h2>
  
  
  Key Improvements
</h2>

<h3>
  
  
  Libraries &amp; Standard Library
</h3>

<ul>
<li>
<strong>Floating-point printing</strong>: The <code>description</code> and <code>debugDescription</code> properties now work for Float, Double, and other floating-point types with a new all-Swift implementation</li>
<li>
<strong>Better diagnostics</strong>: New <code>EmbeddedRestrictions</code> diagnostic group warns about unsupported language constructs</li>
<li>
<strong>Swift MMIO 0.1.x</strong>: Includes code generation from SVD files and improved debugging with SVD2LLDB plugin</li>
</ul>

<h3>
  
  
  C Interoperability
</h3>

<ul>
<li>
<strong><code>@c</code> attribute</strong>: Define C-compatible functions and enums (from SE-0495)
</li>
</ul>

<div class="highlight js-code-highlight">
<pre class="highlight swift"><code><span class="kd">@c</span><span class="p">(</span><span class="kt">MyLib_initialize</span><span class="p">)</span>
<span class="kd">public</span> <span class="kd">func</span> <span class="nf">initialize</span><span class="p">()</span> <span class="p">{</span> <span class="o">...</span> <span class="p">}</span>
</code></pre>

</div>



<ul>
<li>
<strong>Improved type matching</strong>: Better tolerance for mismatching C signatures, eliminating cryptic deserialization errors</li>
</ul>

<h3>
  
  
  Debugging
</h3>

<ul>
<li>
<strong>Enhanced LLDB support</strong>: Better value printing for Embedded Swift types</li>
<li>
<strong>Core dump inspection</strong>: Dictionary, Array, and other common types now inspectable without a live process</li>
<li>
<strong>ARMv7m exception unwinding</strong>: Complete backtraces through exception frames</li>
</ul>

<h3>
  
  
  Linking &amp; Compilation
</h3>

<ul>
<li>
<strong><code>@section</code> and <code>@used</code> attributes</strong>: Control where globals are emitted and ensure symbols aren't stripped (SE-0492)</li>
<li>
<strong>Weak symbol definitions</strong>: Fixes duplicate symbol errors in diamond dependencies</li>
<li>
<strong><code>@export</code> attribute</strong>: Better control over function visibility (SE-0497)</li>
</ul>




<p><em>Want to dive deeper? Read the <a href="https://www.swift.org/blog/embedded-swift-improvements-coming-in-swift-6.3/" rel="noopener noreferrer">full announcement</a> on Swift.org</em></p>

تقدم Swift 6.3 تحسينات كبيرة على Embedded Swift، مما يعزز وظيفته في البيئات ذات الموارد المحدودة مثل المتحكمات الدقيقة. تشمل التحسينات الرئيسية قدرات جديدة لطباعة الأعداد العشرية، وتشخيصات أفضل مع مجموعة EmbeddedRestrictions، وإدخال Swift MMIO 0.1.x لتوليد الشيفرة وتصحيح الأخطاء.

Swift 6.3 presenta mejoras significativas en Embedded Swift, aumentando su funcionalidad para entornos con recursos limitados como microcontroladores. Las mejoras clave incluyen nuevas capacidades de impresión de números de punto flotante, mejores diagnósticos con el grupo EmbeddedRestrictions y la introducción de Swift MMIO 0.1.x para la generación de código y la depuración.

Swift 6.3 apporte des améliorations significatives à Embedded Swift, renforçant sa fonctionnalité pour les environnements à ressources limitées comme les microcontrôleurs. Les principales améliorations comprennent de nouvelles capacités d'impression de nombres à virgule flottante, de meilleurs diagnostics avec le groupe EmbeddedRestrictions, et l'introduction de Swift MMIO 0.1.x pour la génération de code et le débogage.

Swift 6.3 introduces significant upgrades to Embedded Swift, enhancing its functionality for resource-constrained environments like microcontrollers. Key improvements include new floating-point printing capabilities, better diagnostics with the EmbeddedRestrictions group, and the introduction of Swift MMIO 0.1.x for code generation and debugging.

Embedded Swift Gets Major Upgrades in Swift 6.3

<a href="https://www.techspot.com/news/110317-judge-dismisses-lawsuit-twice-due-alleged-deepfake-video.html" target="_blank"><img src="https://www.techspot.com/images2/news/ts3_thumbs/2025/11/2025-11-19-ts3_thumbs-252.jpg" width="800" height="560" style="padding: 15px 0" title="Judge dismisses lawsuit twice due to alleged deepfake video testimony" /></a><br />A California housing dispute is getting media attention over allegations that lawyers presented a deepfake video as witness testimony. NBC News reports that Judge Victoria Kolakowski became suspicious after the supposed witness showed signs that something was not right, including a monotone voice, fuzzy facial features, and repeated facial expressions....<br /><br /><a href="https://www.techspot.com/news/110317-judge-dismisses-lawsuit-twice-due-alleged-deepfake-video.html">Read Entire Article</a><br /><br />

تجذب نزاع سكني في كاليفورنيا الانتباه الإعلامي بعد ظهور مزاعم بأن المحامين قدموا فيديو مزيف كدليل شهود. أعربت القاضية فيكتوريا كولاكوفسكي عن شكوكها بشأن الفيديو، مشيرة إلى صوت الشاهد الأحادي، وملامح الوجه غير الواضحة، وتكرار التعبيرات. أدى ذلك إلى رفض الدعوى القضائية مرتين.

Una disputa de vivienda en California ha llamado la atención de los medios tras las alegaciones de que los abogados presentaron un video deepfake como testimonio. La jueza Victoria Kolakowski expresó su escepticismo sobre el video, señalando la voz monótona del testigo, rasgos faciales borrosos y expresiones repetitivas. Esto llevó al desestimado de la demanda en dos ocasiones.

Un litige immobilier en Californie suscite l'attention des médias après des allégations selon lesquelles des avocats auraient présenté une vidéo deepfake comme témoignage. La juge Victoria Kolakowski a exprimé des doutes sur la vidéo, notant la voix monotone du témoin, des traits faciaux flous et des expressions répétitives. Cela a conduit à l'annulation de la poursuite à deux reprises.

A California housing dispute has drawn attention after allegations surfaced that lawyers presented a deepfake video as witness testimony. Judge Victoria Kolakowski expressed skepticism about the video, noting the witness's monotone voice, unclear facial features, and repetitive expressions. This led to the dismissal of the lawsuit on two occasions.

Judge dismisses lawsuit twice due to alleged deepfake video testimony

arXiv:2511.13897v1 Announce Type: new 
Abstract: Temporal realism remains a central weakness of current generative video models, as most evaluation metrics prioritize spatial appearance and offer limited sensitivity to motion. We introduce a scalable, model-agnostic framework that assesses temporal behavior using motion vectors (MVs) extracted directly from compressed video streams. Codec-generated MVs from standards such as H.264 and HEVC provide lightweight, resolution-consistent descriptors of motion dynamics. We quantify realism by computing Kullback-Leibler, Jensen-Shannon, and Wasserstein divergences between MV statistics of real and generated videos. Experiments on the GenVidBench dataset containing videos from eight state-of-the-art generators reveal systematic discrepancies from real motion: entropy-based divergences rank Pika and SVD as closest to real videos, MV-sum statistics favor VC2 and Text2Video-Zero, and CogVideo shows the largest deviations across both measures. Visualizations of MV fields and class-conditional motion heatmaps further reveal center bias, sparse and piecewise constant flows, and grid-like artifacts that frame-level metrics do not capture. Beyond evaluation, we investigate MV-RGB fusion through channel concatenation, cross-attention, joint embedding, and a motion-aware fusion module. Incorporating MVs improves downstream classification across ResNet, I3D, and TSN backbones, with ResNet-18 and ResNet-34 reaching up to 97.4% accuracy and I3D achieving 99.0% accuracy on real-versus-generated discrimination. These findings demonstrate that compressed-domain MVs provide an effective temporal signal for diagnosing motion defects in generative videos and for strengthening temporal reasoning in discriminative models. The implementation is available at: https://github.com/KurbanIntelligenceLab/Motion-Vector-Learning

تناقش الورقة تقييم الواقعية الزمنية في نماذج الفيديو التوليدية، مع تسليط الضوء على ضعف كبير في المقاييس الحالية التي تركز بشكل أساسي على المظهر المكاني. يتم تقديم إطار عمل جديد يستخدم متجهات الحركة المستخرجة من تدفقات الفيديو المضغوطة لتقييم السلوك الزمني. من خلال تحليل تباينات كولباك-ليبلر، وجنسن-شانون، وفاسرشتاين بين مقاييس الحركة للفيديوهات الحقيقية والمولدة، تحدد الدراسة الفجوات في ديناميات الحركة، حيث تصنف النماذج مثل بيكا وSVD على أنها الأقرب إلى الفيديوهات الحقيقية.

El artículo aborda la evaluación del realismo temporal en modelos de video generativos, destacando una limitación significativa en las métricas actuales que se centran principalmente en la apariencia espacial. Se introduce un nuevo marco que utiliza vectores de movimiento extraídos de flujos de video comprimidos para evaluar el comportamiento temporal. Al analizar las divergencias de Kullback-Leibler, Jensen-Shannon y Wasserstein entre videos reales y generados, el estudio identifica discrepancias en la dinámica del movimiento, con modelos que muestran diferentes grados de realismo.

Cet article traite de l'évaluation du réalisme temporel dans les modèles vidéo génératifs, soulignant une limitation significative des métriques actuelles qui se concentrent principalement sur l'apparence spatiale. Un nouveau cadre est introduit, utilisant des vecteurs de mouvement extraits de flux vidéo compressés pour évaluer le comportement temporel. En analysant les divergences de Kullback-Leibler, de Jensen-Shannon et de Wasserstein entre les vidéos réelles et générées, l'étude identifie des écarts dans la dynamique du mouvement, certains modèles montrant des degrés de réalisme variables.

The paper discusses the evaluation of temporal realism in generative video models, highlighting a significant limitation in current metrics that focus primarily on spatial appearance. A new framework is introduced that utilizes motion vectors extracted from compressed video streams to assess temporal behavior. By analyzing Kullback-Leibler, Jensen-Shannon, and Wasserstein divergences between real and generated videos, the study identifies discrepancies in motion dynamics, with specific models showing varying degrees of realism.

Temporal Realism Evaluation of Generated Videos Using Compressed-Domain Motion Vectors

arXiv:2510.13137v2 Announce Type: replace 
Abstract: This study investigates the performance of 3D Convolutional Neural Networks (3D CNNs) and Long Short-Term Memory (LSTM) networks for real-time American Sign Language (ASL) recognition. Though 3D CNNs are good at spatiotemporal feature extraction from video sequences, LSTMs are optimized for modeling temporal dependencies in sequential data. We evaluate both architectures on a dataset containing 1,200 ASL signs across 50 classes, comparing their accuracy, computational efficiency, and latency under similar training conditions. Experimental results demonstrate that 3D CNNs achieve 92.4% recognition accuracy but require 3.2% more processing time per frame compared to LSTMs, which maintain 86.7% accuracy with significantly lower resource consumption. The hybrid 3D CNNLSTM model shows decent performance, which suggests that context-dependent architecture selection is crucial for practical implementation.This project provides professional benchmarks for developing assistive technologies, highlighting trade-offs between recognition precision and real-time operational requirements in edge computing environments.

تستكشف هذه الدراسة أداء الشبكات العصبية التلافيفية ثلاثية الأبعاد (3D CNN) والشبكات ذات الذاكرة طويلة المدى (LSTM) في التعرف على لغة الإشارة الأمريكية (ASL) في الوقت الحقيقي. تعتمد التقييمات على مجموعة بيانات تحتوي على 1200 إشارة ASL عبر 50 فئة، مع التركيز على الدقة والكفاءة الحاسوبية والزمن المستغرق. تظهر النتائج أن الشبكات ثلاثية الأبعاد تحقق دقة اعتراف بنسبة 92.4% ولكنها تتطلب وقت معالجة أكبر لكل إطار مقارنةً بالشبكات LSTM، التي تحافظ على دقة 86.7% مع استهلاك موارد أقل. يظهر النموذج الهجين أداءً جيدًا، مما يبرز أهمية اختيار الهيكل المناسب.

Este estudio investiga el rendimiento de las Redes Neuronales Convolucionales 3D (3D CNN) y las redes de Memoria a Largo Plazo (LSTM) para el reconocimiento en tiempo real de la Lengua de Señas Americana (ASL). La evaluación se basa en un conjunto de datos que contiene 1,200 signos ASL en 50 clases, centrándose en la precisión, la eficiencia computacional y la latencia. Los resultados muestran que las 3D CNN logran una precisión de reconocimiento del 92.4% pero requieren más tiempo de procesamiento por cuadro en comparación con las LSTM, que mantienen una precisión del 86.7% con un menor consu…

Cette étude examine la performance des réseaux de neurones convolutifs 3D (3D CNN) et des réseaux de mémoire à long terme (LSTM) pour la reconnaissance en temps réel de la langue des signes américaine (ASL). L'évaluation repose sur un ensemble de données de 1 200 signes ASL répartis sur 50 classes, en se concentrant sur la précision, l'efficacité computationnelle et la latence. Les résultats montrent que les 3D CNN atteignent une précision de reconnaissance de 92,4 % mais nécessitent plus de temps de traitement par image par rapport aux LSTM, qui maintiennent une précision de 86,7 % avec une c…

This study investigates the performance of 3D Convolutional Neural Networks (3D CNNs) and Long Short-Term Memory (LSTM) networks for real-time American Sign Language (ASL) recognition. The evaluation is based on a dataset of 1,200 ASL signs across 50 classes, focusing on accuracy, computational efficiency, and latency. Results show that 3D CNNs achieve 92.4% recognition accuracy but require more processing time per frame compared to LSTMs, which maintain 86.7% accuracy with lower resource consumption. A hybrid model demonstrates decent performance, highlighting the importance of architecture s…

Real-Time Sign Language to text Translation using Deep Learning: A Comparative study of LSTM and 3D CNN

arXiv:2511.14268v1 Announce Type: cross 
Abstract: Heterogeneous porous materials play a crucial role in various engineering systems. Microstructure characterization and reconstruction provide effective means for modeling these materials, which are critical for conducting physical property simulations, structure-property linkage studies, and enhancing their performance across different applications. To achieve superior controllability and applicability with small sample sizes, we propose a statistically controllable microstructure reconstruction framework that integrates neural networks with sliced-Wasserstein metric. Specifically, our approach leverages local pattern distribution for microstructure characterization and employs a controlled sampling strategy to generate target distributions that satisfy given conditional parameters. A neural network-based model establishes the mapping from the input distribution to the target local pattern distribution, enabling microstructure reconstruction. Combinations of sliced-Wasserstein metric and gradient optimization techniques minimize the distance between these distributions, leading to a stable and reliable model. Our method can perform stochastic and controllable reconstruction tasks even with small sample sizes. Additionally, it can generate large-size (e.g. 512 and 1024) 3D microstructures using a chunking strategy. By introducing spatial location masks, our method excels at generating spatially heterogeneous and complex microstructures. We conducted experiments on stochastic reconstruction, controllable reconstruction, heterogeneous reconstruction, and large-size microstructure reconstruction across various materials. Comparative analysis through visualization, statistical measures, and physical property simulations demonstrates the effectiveness, providing new insights and possibilities for research on structure-property linkage and material inverse design.

تم اقتراح إطار عمل جديد لإعادة بناء الميكروستركشر للمواد المسامية غير المتجانسة، حيث يتم دمج الشبكات العصبية مع مقياس ووترستين المقطوع. تعزز هذه الطريقة من توصيف وإعادة بناء الميكروستركشر، وهما أمران أساسيان لنمذجة هذه المواد في التطبيقات الهندسية. من خلال استخدام توزيع الأنماط المحلية واستراتيجية أخذ عينات محكومة، يهدف الإطار إلى تحسين القابلية للتحكم والتطبيق في إعادة بناء الميكروستركشر، حتى مع أحجام عينات صغيرة.

Se ha propuesto un nuevo marco para la reconstrucción de la microestructura de materiales heterogéneos porosos, integrando redes neuronales con la métrica de Wasserstein cortada. Este enfoque mejora la caracterización y reconstrucción de la microestructura, que son esenciales para modelar materiales en aplicaciones de ingeniería. Al utilizar la distribución de patrones locales y una estrategia de muestreo controlado, el marco busca mejorar la controlabilidad y aplicabilidad de la reconstrucción de microestructuras, incluso con tamaños de muestra pequeños.

Un nouveau cadre pour la reconstruction de la microstructure des matériaux hétérogènes poreux a été proposé, intégrant des réseaux de neurones avec la métrique de Wasserstein tranchée. Cette approche améliore la caractérisation et la reconstruction de la microstructure, essentielles pour modéliser les matériaux dans les applications d'ingénierie. En utilisant la distribution des motifs locaux et une stratégie d'échantillonnage contrôlé, le cadre vise à améliorer la contrôlabilité et l'applicabilité de la reconstruction de la microstructure, même avec de petites tailles d'échantillons.

A new framework for reconstructing the microstructure of heterogeneous porous materials has been proposed, integrating neural networks with the sliced-Wasserstein metric. This approach enhances microstructure characterization and reconstruction, which are essential for modeling materials in engineering applications. By utilizing local pattern distribution and a controlled sampling strategy, the framework aims to improve the controllability and applicability of microstructure reconstruction, even with small sample sizes.

Statistically controllable microstructure reconstruction framework for heterogeneous materials using sliced-Wasserstein metric and neural networks

arXiv:2408.00540v4 Announce Type: replace-cross 
Abstract: Artificial Intelligence (AI) is being incorporated in several optimization, scheduling, orchestration as well as in native communication network functions. This paradigm shift results in increased energy consumption, however, quantifying the end-to-end energy consumption of adding intelligence to communication systems remains an open challenge since conventional energy consumption metrics focus on either communication, computation infrastructure, or model development. To address this, we propose a new metric, the Energy Cost of AI Lifecycle (eCAL) of an AI model in a system. eCAL captures the energy consumption throughout the development, deployment and utilization of an AI-model providing intelligence in a communication network by (i) analyzing the complexity of data collection and manipulation in individual components and (ii) deriving overall and per-bit energy consumption. We show that as a trained AI model is used more frequently for inference, its energy cost per inference decreases, since the fixed training energy is amortized over a growing number of inferences. For a simple case study we show that eCAL for 100 inferences is 2.73 times higher than for 1000 inferences. Additionally, we have developed a modular and extendable open-source simulation tool to enable researchers, practitioners, and engineers to calculate the end-to-end energy cost with various configurations and across various systems, ensuring adaptability to diverse use cases.

يتناول المقال دمج الذكاء الاصطناعي (AI) في شبكات الاتصال، مشيرًا إلى زيادة استهلاك الطاقة المرتبطة بهذا التحول. يقدم مقياسًا جديدًا يسمى تكلفة الطاقة لدورة حياة الذكاء الاصطناعي (eCAL)، والذي يقيس الطاقة المستخدمة خلال تطوير ونشر واستخدام نماذج الذكاء الاصطناعي في أنظمة الاتصال. تؤكد الدراسة على الحاجة إلى فهم شامل لمقاييس استهلاك الطاقة، التي تركز تقليديًا على الاتصال أو بنية الحوسبة أو تطوير النماذج.

El artículo aborda la integración de la inteligencia artificial (IA) en las redes de comunicación, destacando el aumento del consumo de energía asociado con este cambio. Presenta una nueva métrica llamada Costo Energético del Ciclo de Vida de la IA (eCAL), que cuantifica la energía utilizada durante el desarrollo, implementación y utilización de modelos de IA en sistemas de comunicación. El estudio enfatiza la necesidad de una comprensión integral de las métricas de consumo de energía, que tradicionalmente se centran en la comunicación, infraestructura de computación o desarrollo de modelos.

L'article traite de l'intégration de l'intelligence artificielle (IA) dans les réseaux de communication, soulignant l'augmentation de la consommation d'énergie associée à ce changement. Il présente un nouveau métrique appelé le Coût Énergétique du Cycle de Vie de l'IA (eCAL), qui quantifie l'énergie utilisée lors du développement, du déploiement et de l'utilisation des modèles d'IA dans les systèmes de communication. L'étude met en avant la nécessité d'une compréhension globale des métriques de consommation d'énergie, qui se concentrent traditionnellement sur la communication, l'infrastructure…

The article discusses the integration of Artificial Intelligence (AI) into communication networks, highlighting the increased energy consumption associated with this shift. It presents a new metric called the Energy Cost of AI Lifecycle (eCAL), which quantifies the energy used during the development, deployment, and utilization of AI models in communication systems. The study emphasizes the need for a comprehensive understanding of energy consumption metrics, which traditionally focus on communication, computation infrastructure, or model development.

The Energy Cost of Artificial Intelligence Lifecycle in Communication Networks

arXiv:2511.14465v1 Announce Type: new 
Abstract: Mechanistic interpretability research requires reliable tools for analyzing transformer internals across diverse architectures. Current approaches face a fundamental tradeoff: custom implementations like TransformerLens ensure consistent interfaces but require coding a manual adaptation for each architecture, introducing numerical mismatch with the original models, while direct HuggingFace access through NNsight preserves exact behavior but lacks standardization across models. To bridge this gap, we develop nnterp, a lightweight wrapper around NNsight that provides a unified interface for transformer analysis while preserving original HuggingFace implementations. Through automatic module renaming and comprehensive validation testing, nnterp enables researchers to write intervention code once and deploy it across 50+ model variants spanning 16 architecture families. The library includes built-in implementations of common interpretability methods (logit lens, patchscope, activation steering) and provides direct access to attention probabilities for models that support it. By packaging validation tests with the library, researchers can verify compatibility with custom models locally. nnterp bridges the gap between correctness and usability in mechanistic interpretability tooling.

يتناول المقال nnterp، وهي أداة جديدة مصممة لتعزيز البحث في التفسير الميكانيكي لنماذج المحولات. تواجه الأساليب الحالية تحديات في التوحيد والدقة العددية عند تحليل هياكل مختلفة. تعمل nnterp كغلاف خفيف حول NNsight، مما يوفر واجهة موحدة لتحليل المحولات مع الحفاظ على تنفيذات HuggingFace الأصلية. تتيح هذه الأداة للباحثين كتابة كود التدخل مرة واحدة وتطبيقه عبر أكثر من 50 نموذجًا متنوعًا من 16 عائلة معمارية، مما يسهل الاختبارات الشاملة للتفسير.

El artículo presenta nnterp, una nueva herramienta diseñada para mejorar la investigación sobre la interpretabilidad mecanicista de los modelos de transformadores. Los métodos actuales enfrentan desafíos en la estandarización y precisión numérica al analizar diferentes arquitecturas. nnterp actúa como un envoltorio ligero alrededor de NNsight, proporcionando una interfaz unificada para el análisis de transformadores mientras mantiene las implementaciones originales de HuggingFace. Permite a los investigadores escribir código de intervención una vez y aplicarlo a más de 50 variantes de modelos …

L'article présente nnterp, un nouvel outil conçu pour améliorer la recherche sur l'interprétabilité mécaniste des modèles de transformateurs. Les méthodes actuelles rencontrent des défis en matière de standardisation et de précision numérique lors de l'analyse de différentes architectures. nnterp agit comme un wrapper léger autour de NNsight, offrant une interface unifiée pour l'analyse des transformateurs tout en maintenant les implémentations originales de HuggingFace. Il permet aux chercheurs d'écrire un code d'intervention une fois et de l'appliquer à plus de 50 variantes de modèles proven…

The article discusses nnterp, a new tool designed to enhance mechanistic interpretability research for transformer models. Current methods face challenges in standardization and numerical accuracy when analyzing different architectures. nnterp serves as a lightweight wrapper around NNsight, providing a unified interface for transformer analysis while maintaining the original HuggingFace implementations. It allows researchers to write intervention code once and apply it across over 50 model variants from 16 architecture families, facilitating comprehensive interpretability testing.

RoCoISLR: A Romanian Corpus for Isolated Sign Language Recognition

Was this article worth reading? Share it