arXiv:2511.14385v1 Announce Type: new 
Abstract: Large language models (LLMs) are powerful zero- and few-shot learners. However, when predicting over a set of candidate options, LLMs suffer from label biases, and existing calibration methods overlook biases arising from multi-token class labels. We tackle an issue we call label length bias, where labels of different lengths are treated inconsistently, even after standard length normalization. To mitigate it, we propose normalized contextual calibration (NCC), an effective method that normalizes and calibrates predictions at the full-label level. NCC achieves statistically significant improvements over prior approaches across multiple datasets and models, with gains of up to 10% F1. Moreover, NCC extends bias mitigation to broader tasks such as multiple-choice question answering. Our analysis shows that, when combined with in-context learning, NCC is less sensitive to few-shot example selection, requires fewer examples for competitive performance, and produces more reliable confidence estimates. These findings highlight the importance of mitigating full-label biases to improve the performance and robustness of LLM-based methods, particularly in real-world applications where class labels naturally consist of multiple tokens.

تعاني النماذج اللغوية الكبيرة (LLMs) من تحيز طول التسمية، حيث يتم التعامل مع التسميات ذات الأطوال المختلفة بشكل غير متسق على الرغم من جهود التطبيع. يقدم هذا المقال طريقة التطبيع السياقي المنظم (NCC)، وهي طريقة تطبع التنبؤات على مستوى التسمية الكاملة، مما يعالج هذا التحيز بشكل فعال. تظهر NCC تحسينات ذات دلالة إحصائية عبر مجموعات بيانات ونماذج متعددة، مع تحقيق مكاسب تصل إلى 10% في درجات F1. علاوة على ذلك، توسع NCC من تخفيف التحيز ليشمل مهامًا مثل الإجابة على الأسئلة متعددة الخيارات، مما يظهر حساسية أقل لاختيار أمثلة قليلة.

Los grandes modelos de lenguaje (LLMs) presentan un sesgo de longitud de etiqueta, donde las etiquetas de diferentes longitudes se tratan de manera inconsistente a pesar de los esfuerzos de normalización. Este artículo introduce la calibración contextual normalizada (NCC), un método que normaliza las predicciones a nivel de etiqueta completa, abordando eficazmente este sesgo. La NCC muestra mejoras estadísticamente significativas en múltiples conjuntos de datos y modelos, logrando hasta un 10% de aumento en las puntuaciones F1. Además, extiende la mitigación de sesgos a tareas como la respuest…

Les grands modèles de langage (LLMs) présentent un biais de longueur d'étiquette, où les étiquettes de longueurs variées sont traitées de manière incohérente malgré les efforts de normalisation. Cet article introduit la calibration contextuelle normalisée (NCC), une méthode qui normalise les prédictions au niveau de l'étiquette complète, abordant efficacement ce biais. La NCC montre des améliorations statistiquement significatives sur plusieurs ensembles de données et modèles, atteignant jusqu'à 10 % de gains dans les scores F1. De plus, elle étend l'atténuation des biais à des tâches telles q…

Large language models (LLMs) exhibit label length bias, where labels of varying lengths are treated inconsistently despite normalization efforts. This paper introduces normalized contextual calibration (NCC), a method that normalizes predictions at the full-label level, effectively addressing this bias. NCC demonstrates statistically significant improvements across multiple datasets and models, achieving up to 10% gains in F1 scores. Additionally, it extends bias mitigation to tasks like multiple-choice question answering, showing reduced sensitivity to few-shot example selection.

Mitigating Label Length Bias in Large Language Models

The story of the Ghost in the Shell’s main villain the Puppet Master hinted at a future where governments use hackers for espionage, at a time when most of the world had never connected to the internet.

الأنمي الكلاسيكي 'Ghost in the Shell' يقدم شخصية Puppet Master، التي تتنبأ بمستقبل تستخدم فيه الحكومات القراصنة للتجسس. ظهرت هذه التنبؤات في وقت كانت فيه غالبية سكان العالم لم تتصل بعد بالإنترنت، مما يبرز رؤية العرض لمشكلات الأمن السيبراني.

El clásico anime 'Ghost in the Shell' presenta al Puppet Master, un personaje que anticipa un futuro en el que los gobiernos utilizan hackers para el espionaje. Esta predicción surgió en un momento en que la mayoría de la población mundial aún no estaba conectada a Internet, destacando la previsión del programa sobre los problemas de ciberseguridad.

L'anime classique 'Ghost in the Shell' présente le Puppet Master, un personnage qui préfigure un avenir où les gouvernements utilisent des hackers pour l'espionnage. Cette prédiction est survenue à une époque où la majorité de la population mondiale n'était pas encore connectée à Internet, soulignant la prévoyance de l'émission concernant les problèmes de cybersécurité.

The classic anime 'Ghost in the Shell' features the Puppet Master, a character that foreshadows a future where governments utilize hackers for espionage. This prediction emerged at a time when the majority of the global population had yet to connect to the internet, highlighting the show's foresight regarding cybersecurity issues.

How the classic anime ‘Ghost in the Shell’ predicted the future of cybersecurity 30 years ago

<p>Text-to-image diffusion models have become the workhorses of generative imaging. They can paint photorealistic scenes, mimic art styles, and blend concepts in ways that were science fiction a few years ago. Yet they stumble embarrassingly on a skill that even small children master: basic spatial reasoning.</p>

<p>Ask a state-of-the-art model for “a dog to the right of a teddy bear” and you often get:</p>

<ul>
<li>The dog on the left</li>
<li>One of the objects missing</li>
<li>Or a bizarre hybrid where dog and teddy are fused into a single creature</li>
</ul>

<p><a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F49rtb08366xdl284o4z0.jpg" class="article-body-image-wrapper"><img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F49rtb08366xdl284o4z0.jpg" alt=" " width="800" height="532"></a></p>

<p>These failures become more severe for unusual compositions like “a giraffe above an airplane”. Traditional fixes range from expensive fine-tuning to brittle, hand-written loss functions at inference time—but both options come with significant downsides.</p>

<p>NVIDIA’s Learn-to-Steer framework (accepted to WACV 2026) proposes a different path: instead of hard-coding spatial rules or retraining the entire model, it learns a data-driven objective that can “steer” diffusion at inference time. The method reads the model’s own cross-attention maps, trains a lightweight classifier to detect spatial relations, and then uses that classifier’s gradient as a learned loss to nudge the generation towards layouts that match the prompt.</p>

<p>In this blog, we’ll unpack:</p>

<ul>
<li>What makes spatial reasoning so fragile in current diffusion models</li>
<li>How Learn-to-Steer learns spatial constraints from the model itself</li>
<li>How it steers images during generation without changing model weights</li>
<li>The top gains on spatial benchmarks like GenEval and T2I-CompBench</li>
<li>The trade-offs in compute cost and generality, and what this implies for future generative systems</li>
</ul>

<h1>
  
  
  Why Spatial Reasoning Fails in Text-to-Image Diffusion
</h1>

<h2>
  
  
  What Makes Spatial Relations So Difficult for Diffusion Models?
</h2>

<p>Modern diffusion models (e.g., Stable Diffusion, Flux) are excellent at what should appear in an image—objects, styles, textures—but much less reliable at where those objects should be.</p>

<p>Several factors contribute:</p>

<h3>
  
  
  Weak supervision of spatial language
</h3>

<ul>
<li>Training data rarely comes with precise annotations like “object A is left of object B”.
</li>
<li>Captions often describe content loosely, so phrases like “on top of” or “to the right of” are under-specified.</li>
</ul>

<h3>
  
  
  Entangled visual concepts
</h3>

<ul>
<li>When two objects frequently co-occur, models may treat them as a single visual blob.</li>
<li>This leads to object fusion, where a “cat on a bookshelf” becomes a cat-bookshelf chimera.</li>
</ul>

<h3>
  
  
  Benchmark saturation without spatial coverage
</h3>

<ul>
<li>Many standard text-to-image benchmarks emphasize realism and style, not relational accuracy.</li>
<li>Models can score highly while still being spatially confused.</li>
</ul>

<p>Empirical studies confirm three recurring failure modes on spatial benchmarks:</p>

<ul>
<li>Incorrect placement: Objects appear in the wrong relative position.</li>
<li>Missing entities: One or more requested objects never appear.</li>
<li>Merged entities: Two objects get mashed into a single, incoherent form.</li>
</ul>

<p>The model “knows” the objects you asked for, but it doesn’t reliably understand where to place them.</p>

<h1>
  
  
  Why Fine-Tuning and Handcrafted Losses Are Not Enough
</h1>

<p>Two broad strategies have tried to patch this gap:</p>

<h2>
  
  
  Fine-tuning for spatial awareness
</h2>

<ul>
<li>Retrain the diffusion model on datasets with explicit layouts or spatial annotations.</li>
<li>Methods like COMPASS show that this can significantly improve spatial accuracy.</li>
<li>But this comes at a cost: expensive retraining, sensitivity to dataset bias, and often regressions in other capabilities such as color fidelity or counting.</li>
</ul>

<h2>
  
  
  Handcrafted test-time losses
</h2>

<ul>
<li>At inference, inject extra loss terms that penalize spatial errors (e.g., overlapping activation maps, incorrect ordering).</li>
<li>These losses must be manually designed to approximate relations like “left of” or “above”.</li>
<li>In practice, these heuristics are fragile, often over-fitting simple cases and failing on more complex layouts.</li>
</ul>

<p>In short, we’ve lacked a solution that is:</p>

<ul>
<li>Data-driven rather than rule-based</li>
<li>Plug-and-play at inference time (no full retraining)</li>
<li>Targeted enough to improve spatial reasoning without damaging other strengths</li>
</ul>

<p>This is where Learn-to-Steer enters.</p>

<h1>
  
  
  How Learn-to-Steer Works: Data-Driven Steering at Inference
</h1>

<h2>
  
  
  How Cross-Attention Maps Provide a Spatial Signal
</h2>

<p>During diffusion, at each denoising step, the model computes cross-attention maps that connect text tokens to image regions. For a prompt like “a dog to the right of a teddy bear”, you can think of:</p>

<ul>
<li>One set of attention maps for “dog”</li>
<li>Another set for “teddy bear”</li>
<li>Additional context around words like “right” or “of”</li>
</ul>

<p>These maps form a rich, high-dimensional signal describing where in the image the model currently believes each word should manifest. Prior work has used cross-attention to locate objects or edit images; Learn-to-Steer goes further by treating them as a feature space in which spatial relations can be learned.</p>

<h2>
  
  
  How a Relation Classifier Becomes a Learned Loss
</h2>

<p>The core idea of Learn-to-Steer is to train a small relation classifier that takes cross-attention maps for two objects and predicts the spatial relation between them (left-of, right-of, above, below, etc.).</p>

<p>The pipeline looks like this:</p>

<h3>
  
  
  Collect supervision
</h3>

<ul>
<li>Use images where the true relation between object A and object B is known (from datasets like GQA and synthetic layouts).</li>
<li>For each image, invert it through the diffusion model with a descriptive prompt to recover cross-attention maps for the relevant tokens.</li>
</ul>

<p><a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fi9dbjsdc4c8yjz2r88k4.jpg" class="article-body-image-wrapper"><img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fi9dbjsdc4c8yjz2r88k4.jpg" alt=" " width="800" height="446"></a></p>

<h3>
  
  
  Train a classifier on attention patterns
</h3>

<ul>
<li>Input: attention maps for object A and object B.</li>
<li>Output: predicted relation (e.g., “A is left of B”).</li>
</ul>

<p>Naively, however, this leads to a subtle but serious issue: relation leakage.</p>

<h2>
  
  
  How Dual Inversion Solves the “Relation Leakage” Problem
</h2>

<p>If you always invert images with a correct prompt (e.g., “a dog to the left of a cat”), hints about the word “left” can leak into the attention patterns. A naïve classifier might then “cheat” by reading out linguistic artefacts instead of learning genuine visual geometry.</p>

<p>To prevent this, Learn-to-Steer uses a dual inversion strategy:</p>

<ul>
<li>For each image with a true relation (say, dog left of cat), create two prompts:

<ul>
<li>A positive prompt with the correct relation (“dog to the left of a cat”).</li>
<li>A negative prompt with an incorrect relation (“dog above a cat”).</li>
</ul>


</li>

<li>Run inversion with both prompts, obtaining two sets of attention maps.</li>

<li>Label both sets with the true relation (left-of), because that is what the image actually depicts.</li>

</ul>

<p>The classifier sees pairs of attention maps that share the same underlying geometry but differ in the relation words used in the prompt. To succeed, it must ignore the unreliable linguistic cue and zero in on the geometric evidence in the attention patterns. This breaks the leakage shortcut and yields a classifier that actually understands “left-of” in terms of where things appear in the model’s internal vision.</p>

<p>To improve robustness, NVIDIA combines:</p>

<ul>
<li>Real images (complex, natural scenes)</li>
<li>Synthetic images (simpler, cleaner attention patterns akin to generation scenarios)</li>
</ul>

<h1>
  
  
  How Learn-to-Steer Guides Images During Generation
</h1>

<h2>
  
  
  Step-by-Step: From Prompt to Steered Latent
</h2>

<p>Once the relation classifier is trained, Learn-to-Steer uses it at inference time as a learned objective:</p>

<h3>
  
  
  Parse the spatial prompt
</h3>

<ul>
<li>Extract subject, relation, and object from the text (e.g., subject = dog, relation = right-of, object = teddy bear).</li>
</ul>

<h3>
  
  
  Run diffusion as usual—but with checkpoints
</h3>

<ul>
<li>As the model denoises latent noise into an image, periodically extract cross-attention maps for the subject and object tokens.</li>
</ul>

<h3>
  
  
  Evaluate spatial correctness
</h3>

<ul>
<li>Feed these maps into the relation classifier, which outputs a probability distribution over relations.</li>
<li>Compare this distribution to the desired relation from the prompt, and compute a loss (e.g., cross-entropy).</li>
</ul>

<h3>
  
  
  Backpropagate into the latent
</h3>

<ul>
<li>Compute the gradient of this loss with respect to the latent representation at that timestep.</li>
<li>Nudge the latent in the direction that increases the classifier’s confidence in the correct relation.</li>
</ul>

<h3>
  
  
  Continue the diffusion process
</h3>

<ul>
<li>Let the denoising proceed from the adjusted latent.</li>
<li>Repeat this steering a number of times (often during the earlier half of the diffusion steps).</li>
</ul>

<h2>
  
  
  Support for Multiple Architectures and Relations
</h2>

<p><a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F578a9bjc7gmtemh0jbsj.jpg" class="article-body-image-wrapper"><img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F578a9bjc7gmtemh0jbsj.jpg" alt=" " width="800" height="477"></a></p>

<p>A key advantage of Learn-to-Steer is that it’s architecture-agnostic:</p>

<ul>
<li>It has been demonstrated on both UNet-based models (like Stable Diffusion 1.4/2.1) and MMDiT-style models (like Flux).</li>
<li>The only requirement is access to a text-image alignment signal (cross-attention or similar).</li>
</ul>

<p>It can also handle prompts with multiple constraints, such as:</p>

<p>“A frog above a sneaker below a teapot.”</p>

<p>Here, Learn-to-Steer alternates attention between relations:</p>

<ul>
<li>At one timestep, optimize the frog–sneaker relation.</li>
<li>At another, optimize the sneaker–teapot relation.</li>
</ul>

يهدف Learn-to-Steer من NVIDIA إلى معالجة قيد كبير في نماذج الانتشار من النص إلى الصورة، التي تعاني من ضعف في التفكير المكاني الأساسي. يمكن لهذه النماذج إنشاء صور فوتوغرافية واقعية، لكنها غالبًا ما تضع الأشياء في غير موضعها، مثل وضع كلب على اليسار بدلاً من اليمين بجانب دمية دب. تهدف هذه الخطوة إلى تحسين دقة الصور المولدة من خلال تعزيز الفهم المكاني.

El Learn-to-Steer de NVIDIA busca abordar una limitación significativa en los modelos de difusión de texto a imagen, que luchan con el razonamiento espacial básico. Estos modelos pueden crear imágenes fotorealistas, pero a menudo colocan mal los objetos en relación entre sí, como poner un perro a la izquierda de un oso de peluche en lugar de a la derecha. Este avance tiene como objetivo mejorar la precisión de las imágenes generadas al mejorar la comprensión espacial.

Le Learn-to-Steer de NVIDIA vise à résoudre une limitation importante des modèles de diffusion texte-image, qui ont du mal avec le raisonnement spatial de base. Ces modèles peuvent créer des images photoréalistes mais placent souvent mal les objets les uns par rapport aux autres, comme mettre un chien à gauche d'un ours en peluche au lieu de la droite. Cette avancée vise à améliorer l'exactitude des images générées en renforçant la compréhension spatiale.

NVIDIA's Learn-to-Steer is set to address a significant limitation in text-to-image diffusion models, which struggle with basic spatial reasoning. These models can create photorealistic images but often misplace objects in relation to one another, such as placing a dog to the left of a teddy bear instead of the right. This advancement aims to enhance the accuracy of generated images by improving spatial understanding.

What Is Learn-to-Steer? NVIDIA’s 2025 Spatial Fix for Text-to-Image Diffusion

<p>The $5tn firm handily beat expectations, but analysts are awaiting projections for future demand for firm’s AI chips</p><p>Nvidia shares are rising in after-market trading after the company posted third quarter earnings that beat Wall Street estimates.<strong> </strong>All eyes were on Nvidia, the bellwether for the AI industry and the most valuable publicly traded company in the world, as analysts and investors hoped the chipmaker’s third-quarter earnings would assuage concerns about whether the high-flying valuations of AI firms have peaked.</p><p>“Blackwell sales are off the charts, and cloud GPUs are sold out,” said Jensen Huang, founder and CEO of Nvidia in a press release. “Compute demand keeps accelerating and compounding across training and inference – each growing exponentially. We’ve entered the virtuous cycle of AI. The AI ecosystem is scaling fast – with more new foundation model makers, more AI startups, across more industries, and in more countries. AI is going everywhere, doing everything, all at once.”</p> <a href="https://www.theguardian.com/technology/2025/nov/19/nvidia-earning-report">Continue reading...</a>

تجاوزت شركة إنفيديا توقعات وول ستريت مع نتائجها للربع الثالث، مما أظهر طلبًا قويًا على شرائح الذكاء الاصطناعي الخاصة بها. ارتفعت أسهم الشركة في التداول بعد السوق، مما يعكس ثقة المستثمرين وسط مخاوف بشأن تقييم سوق الذكاء الاصطناعي. وأبرز الرئيس التنفيذي جينسن هوانغ مبيعات قياسية ونظامًا بيئيًا سريع التوسع في مجال الذكاء الاصطناعي، مما يشير إلى نظرة إيجابية لمستقبل الشركة.

Nvidia superó las expectativas de Wall Street con sus ganancias del tercer trimestre, mostrando una fuerte demanda por sus chips de IA. Las acciones de la compañía aumentaron en el comercio posterior al cierre, reflejando la confianza de los inversores en medio de preocupaciones sobre la valoración del mercado de IA. El CEO Jensen Huang destacó las ventas récord y un ecosistema de IA en rápida expansión, indicando una perspectiva positiva para el futuro de la empresa.

Nvidia a dépassé les attentes de Wall Street avec ses résultats du troisième trimestre, montrant une forte demande pour ses puces d'IA. Les actions de l'entreprise ont augmenté lors des échanges après la clôture, reflétant la confiance des investisseurs face aux préoccupations concernant la valorisation du marché de l'IA. Le PDG Jensen Huang a souligné des ventes record et un écosystème IA en pleine expansion, indiquant une perspective positive pour l'avenir de l'entreprise.

Nvidia exceeded Wall Street expectations with its third-quarter earnings, showcasing strong demand for its AI chips. The company's shares rose in after-market trading, reflecting investor confidence amid concerns about the AI market's valuation. CEO Jensen Huang highlighted record sales and a rapidly expanding AI ecosystem, indicating a positive outlook for the company's future.

‘AI is going everywhere, doing everything:’ Nvidia beats Wall Street estimates amid market selloff and AI bubble fears

The SanDisk ExtremeFit USB-C flash drive is barely three grams, but offers 1TB of external storage and impressive speeds.

I refused to believe this coin-sized gadget was a storage drive, until I tried it for myself

<p>Swift 6.3 is bringing significant enhancements to Embedded Swift, the subset of Swift designed for resource-constrained environments like microcontrollers. Here's what's new:</p>

<h2>
  
  
  Key Improvements
</h2>

<h3>
  
  
  Libraries &amp; Standard Library
</h3>

<ul>
<li>
<strong>Floating-point printing</strong>: The <code>description</code> and <code>debugDescription</code> properties now work for Float, Double, and other floating-point types with a new all-Swift implementation</li>
<li>
<strong>Better diagnostics</strong>: New <code>EmbeddedRestrictions</code> diagnostic group warns about unsupported language constructs</li>
<li>
<strong>Swift MMIO 0.1.x</strong>: Includes code generation from SVD files and improved debugging with SVD2LLDB plugin</li>
</ul>

<h3>
  
  
  C Interoperability
</h3>

<ul>
<li>
<strong><code>@c</code> attribute</strong>: Define C-compatible functions and enums (from SE-0495)
</li>
</ul>

<div class="highlight js-code-highlight">
<pre class="highlight swift"><code><span class="kd">@c</span><span class="p">(</span><span class="kt">MyLib_initialize</span><span class="p">)</span>
<span class="kd">public</span> <span class="kd">func</span> <span class="nf">initialize</span><span class="p">()</span> <span class="p">{</span> <span class="o">...</span> <span class="p">}</span>
</code></pre>

</div>



<ul>
<li>
<strong>Improved type matching</strong>: Better tolerance for mismatching C signatures, eliminating cryptic deserialization errors</li>
</ul>

<h3>
  
  
  Debugging
</h3>

<ul>
<li>
<strong>Enhanced LLDB support</strong>: Better value printing for Embedded Swift types</li>
<li>
<strong>Core dump inspection</strong>: Dictionary, Array, and other common types now inspectable without a live process</li>
<li>
<strong>ARMv7m exception unwinding</strong>: Complete backtraces through exception frames</li>
</ul>

<h3>
  
  
  Linking &amp; Compilation
</h3>

<ul>
<li>
<strong><code>@section</code> and <code>@used</code> attributes</strong>: Control where globals are emitted and ensure symbols aren't stripped (SE-0492)</li>
<li>
<strong>Weak symbol definitions</strong>: Fixes duplicate symbol errors in diamond dependencies</li>
<li>
<strong><code>@export</code> attribute</strong>: Better control over function visibility (SE-0497)</li>
</ul>




<p><em>Want to dive deeper? Read the <a href="https://www.swift.org/blog/embedded-swift-improvements-coming-in-swift-6.3/" rel="noopener noreferrer">full announcement</a> on Swift.org</em></p>

تقدم Swift 6.3 تحسينات كبيرة على Embedded Swift، مما يعزز وظيفته في البيئات ذات الموارد المحدودة مثل المتحكمات الدقيقة. تشمل التحسينات الرئيسية قدرات جديدة لطباعة الأعداد العشرية، وتشخيصات أفضل مع مجموعة EmbeddedRestrictions، وإدخال Swift MMIO 0.1.x لتوليد الشيفرة وتصحيح الأخطاء.

Swift 6.3 presenta mejoras significativas en Embedded Swift, aumentando su funcionalidad para entornos con recursos limitados como microcontroladores. Las mejoras clave incluyen nuevas capacidades de impresión de números de punto flotante, mejores diagnósticos con el grupo EmbeddedRestrictions y la introducción de Swift MMIO 0.1.x para la generación de código y la depuración.

Swift 6.3 apporte des améliorations significatives à Embedded Swift, renforçant sa fonctionnalité pour les environnements à ressources limitées comme les microcontrôleurs. Les principales améliorations comprennent de nouvelles capacités d'impression de nombres à virgule flottante, de meilleurs diagnostics avec le groupe EmbeddedRestrictions, et l'introduction de Swift MMIO 0.1.x pour la génération de code et le débogage.

Swift 6.3 introduces significant upgrades to Embedded Swift, enhancing its functionality for resource-constrained environments like microcontrollers. Key improvements include new floating-point printing capabilities, better diagnostics with the EmbeddedRestrictions group, and the introduction of Swift MMIO 0.1.x for code generation and debugging.

Embedded Swift Gets Major Upgrades in Swift 6.3

<a href="https://www.techspot.com/news/110317-judge-dismisses-lawsuit-twice-due-alleged-deepfake-video.html" target="_blank"><img src="https://www.techspot.com/images2/news/ts3_thumbs/2025/11/2025-11-19-ts3_thumbs-252.jpg" width="800" height="560" style="padding: 15px 0" title="Judge dismisses lawsuit twice due to alleged deepfake video testimony" /></a><br />A California housing dispute is getting media attention over allegations that lawyers presented a deepfake video as witness testimony. NBC News reports that Judge Victoria Kolakowski became suspicious after the supposed witness showed signs that something was not right, including a monotone voice, fuzzy facial features, and repeated facial expressions....<br /><br /><a href="https://www.techspot.com/news/110317-judge-dismisses-lawsuit-twice-due-alleged-deepfake-video.html">Read Entire Article</a><br /><br />

تجذب نزاع سكني في كاليفورنيا الانتباه الإعلامي بعد ظهور مزاعم بأن المحامين قدموا فيديو مزيف كدليل شهود. أعربت القاضية فيكتوريا كولاكوفسكي عن شكوكها بشأن الفيديو، مشيرة إلى صوت الشاهد الأحادي، وملامح الوجه غير الواضحة، وتكرار التعبيرات. أدى ذلك إلى رفض الدعوى القضائية مرتين.

Una disputa de vivienda en California ha llamado la atención de los medios tras las alegaciones de que los abogados presentaron un video deepfake como testimonio. La jueza Victoria Kolakowski expresó su escepticismo sobre el video, señalando la voz monótona del testigo, rasgos faciales borrosos y expresiones repetitivas. Esto llevó al desestimado de la demanda en dos ocasiones.

Un litige immobilier en Californie suscite l'attention des médias après des allégations selon lesquelles des avocats auraient présenté une vidéo deepfake comme témoignage. La juge Victoria Kolakowski a exprimé des doutes sur la vidéo, notant la voix monotone du témoin, des traits faciaux flous et des expressions répétitives. Cela a conduit à l'annulation de la poursuite à deux reprises.

A California housing dispute has drawn attention after allegations surfaced that lawyers presented a deepfake video as witness testimony. Judge Victoria Kolakowski expressed skepticism about the video, noting the witness's monotone voice, unclear facial features, and repetitive expressions. This led to the dismissal of the lawsuit on two occasions.

Judge dismisses lawsuit twice due to alleged deepfake video testimony

arXiv:2511.14214v1 Announce Type: cross 
Abstract: Large language models (LLMs) are increasingly used in finance and economics, where prompt-based attempts against look-ahead bias implicitly assume that models understand chronology. We test this fundamental question with a series of chronological ordering tasks with increasing complexities over facts the model already knows from pre-training. Our tasks cover (1) chronological ordering, (2) conditional sorting (filter, then order), and (3) anachronism detection. We evaluate GPT-4.1, Claude-3.7 Sonnet, with and without Extended Thinking (ET), and GPT-5 across multiple reasoning-effort settings. Across models, Exact match rate drops sharply as sequences lengthen even while rank correlations stay high as LLMs largely preserve local order but struggle to maintain a single globally consistent timeline. In conditional sorting, most failures stem from the filtering step rather than the ordering step, but GPT-5 and Claude-3.7 Sonnet with Extended Thinking outshine normal models significantly. Lastly, anachronism detection is found to be the easiest task for the LLMs but performance still declines with increasingly overlapping timelines or entities. Overall, our main contribution is showing that allocating explicit reasoning budget helps with chronological ordering with GPT-5 at medium/high reasoning effort achieving flawless ordering at all lengths and perfect conditional sorting (both self-filtered and given-subset), whereas low/minimal effort degrades with longer lists, mirroring earlier models. Our findings delineate limits of current LLMs on chronological tasks, providing insights into task complexity, and demonstrate scenarios in which reasoning helps. These patterns are important for the real-time application of LLMs in finance. We release all code and evaluation templates to support full reproducibility.

تستخدم النماذج اللغوية الكبيرة (LLMs) بشكل متزايد في المالية والاقتصاد، حيث تعتبر قدرتها على فهم التسلسل الزمني أمرًا حاسمًا. اختبرت دراسة هذه القدرة من خلال مجموعة من المهام المتعلقة بالترتيب الزمني، وكشفت أن النماذج مثل GPT-4.1 وGPT-5 يمكنها الحفاظ على الترتيب المحلي، لكنها تواجه صعوبة في إنشاء خط زمني عالمي متسق. تشير النتائج إلى انخفاض حاد في معدلات المطابقة الدقيقة مع زيادة تعقيد المهام، خاصة في مهام الفرز الشرطي، مما يبرز القيود الجوهرية في التفكير الزمني للنماذج اللغوية الكبيرة.

Los grandes modelos de lenguaje (LLMs) se utilizan cada vez más en finanzas y economía, donde su capacidad para entender la cronología es crucial. Un estudio evaluó esta capacidad a través de diversas tareas de orden cronológico, revelando que, aunque modelos como GPT-4.1 y GPT-5 pueden mantener el orden local, tienen dificultades para crear una línea de tiempo global consistente. Los hallazgos indican una caída significativa en las tasas de coincidencia exacta a medida que aumenta la complejidad de las tareas, especialmente en tareas de clasificación condicional, destacando limitaciones inher…

Les grands modèles de langage (LLMs) sont de plus en plus utilisés dans la finance et l'économie, où leur capacité à comprendre la chronologie est cruciale. Une étude a testé cette capacité à travers diverses tâches d'ordre chronologique, révélant que, bien que des modèles comme GPT-4.1 et GPT-5 puissent maintenir un ordre local, ils ont du mal à créer une chronologie globale cohérente. Les résultats indiquent une forte baisse des taux de correspondance exacte à mesure que la complexité des tâches augmente, en particulier dans les tâches de tri conditionnel, soulignant les limites inhérentes a…

Large language models (LLMs) are increasingly utilized in finance and economics, where their ability to understand chronology is critical. A study tested this capability through various chronological ordering tasks, revealing that while models like GPT-4.1 and GPT-5 can maintain local order, they struggle with creating a consistent global timeline. The findings indicate a significant drop in exact match rates as task complexity increases, particularly in conditional sorting tasks, highlighting inherent limitations in LLMs' chronological reasoning.

Do Large Language Models (LLMs) Understand Chronology?

arXiv:2511.11018v1 Announce Type: new 
Abstract: Large language models (LLMs) are increasingly tasked with generating structured outputs. While structured generation methods ensure validity, they often lack output diversity, a critical limitation that we confirm in our preliminary study. We propose a novel method to enhance diversity in automaton-based structured generation. Our approach utilizes automata traversal history to steer LLMs towards novel structural patterns. Evaluations show our method significantly improves structural and content diversity while maintaining comparable generation efficiency. Furthermore, we conduct a case study showcasing the effectiveness of our method in generating diverse test cases for testing open-source libraries.

تُستخدم نماذج اللغة الكبيرة (LLMs) بشكل متزايد لتوليد مخرجات منظمة، لكن الأساليب الحالية غالبًا ما تفتقر إلى التنوع في نتائجها. تؤكد دراسة حديثة هذه المحدودية وتقترح طريقة جديدة لتعزيز التنوع من خلال توليد منظم قائم على الأوتومات. من خلال استخدام تاريخ التنقل في الأوتومات، توجه الطريقة نماذج اللغة الكبيرة نحو أنماط هيكلية جديدة. تُظهر التقييمات تحسنًا كبيرًا في كل من التنوع الهيكلي والمحتوى مع الحفاظ على كفاءة التوليد. توضح دراسة حالة فعالية هذه الطريقة في إنشاء حالات اختبار متنوعة لمكتبات مفتوحة المصدر.

Los grandes modelos de lenguaje (LLMs) se utilizan cada vez más para generar salidas estructuradas, pero los métodos existentes a menudo carecen de diversidad en sus resultados. Un estudio reciente confirma esta limitación y propone un nuevo método para mejorar la diversidad de las salidas a través de la generación estructurada basada en autómatas. Al utilizar el historial de recorrido de autómatas, el método guía a los LLMs hacia patrones estructurales novedosos. Las evaluaciones indican una mejora significativa tanto en la diversidad estructural como en el contenido, manteniendo la eficienci…

Les grands modèles de langage (LLMs) sont de plus en plus utilisés pour générer des sorties structurées, mais les méthodes existantes manquent souvent de diversité dans leurs résultats. Une étude récente confirme cette limitation et propose une nouvelle méthode pour améliorer la diversité des sorties grâce à la génération structurée basée sur des automates. En utilisant l'historique de parcours des automates, la méthode guide les LLMs vers des motifs structurels novateurs. Les évaluations montrent une amélioration significative de la diversité structurelle et de contenu tout en maintenant l'ef…

Large language models (LLMs) are increasingly used for generating structured outputs, but existing methods often lack diversity in their results. A recent study confirms this limitation and proposes a new method to enhance output diversity through automaton-based structured generation. By utilizing automata traversal history, the method guides LLMs towards generating novel structural patterns. Evaluations indicate a significant improvement in both structural and content diversity while maintaining generation efficiency. A case study demonstrates its effectiveness in creating diverse test cases…

Automata-Based Steering of Large Language Models for Diverse Structured Generation

arXiv:2506.22481v2 Announce Type: replace-cross 
Abstract: In recent years, significant advancements in the field of Natural Language Processing (NLP) have positioned commercialized language models as wide-reaching, highly useful tools. In tandem, there has been an explosion of multidisciplinary research examining how NLP tasks reflect, perpetuate, and amplify social biases such as gender and racial bias. A significant gap in this scholarship is a detailed analysis of how queer sexualities are encoded and (mis)represented by both NLP systems and practitioners. Following previous work in the field of AI fairness, we document how sexuality is defined and operationalized via a survey and analysis of 55 articles that quantify sexuality-based NLP bias. We find that sexuality is not clearly defined in a majority of the literature surveyed, indicating a reliance on assumed or normative conceptions of sexual/romantic practices and identities. Further, we find that methods for extracting biased outputs from NLP technologies often conflate gender and sexual identities, leading to monolithic conceptions of queerness and thus improper quantifications of bias. With the goal of improving sexuality-based NLP bias analyses, we conclude with recommendations that encourage more thorough engagement with both queer communities and interdisciplinary literature.

أدت التطورات الأخيرة في معالجة اللغة الطبيعية (NLP) إلى استخدام واسع النطاق لنماذج اللغة، مما أثار أبحاثًا حول كيفية انعكاس وتعزيز التحيزات الاجتماعية، بما في ذلك التحيزات الجندرية والعرقية. ومع ذلك، هناك فجوة ملحوظة في تحليل كيفية تمثيل الهويات الجنسية غير التقليدية في أنظمة NLP. تكشف دراسة شملت 55 مقالًا أن مفهوم الجنسية غالبًا ما يكون غير محدد بوضوح، مما يعتمد على افتراضات معيارية حول الهويات والممارسات الجنسية والرومانسية، مما يثير مخاوف بشأن كيفية تشغيل مفهوم الجنسية في أبحاث التحيز في NLP.

Los avances recientes en el procesamiento del lenguaje natural (NLP) han llevado a un uso generalizado de modelos de lenguaje, lo que ha provocado investigaciones sobre cómo se reflejan y amplifican los sesgos sociales, incluidos los sesgos de género y raciales. Sin embargo, existe una notable brecha en el análisis de cómo se representan las sexualidades queer en los sistemas de NLP. Una encuesta de 55 artículos revela que la sexualidad a menudo está mal definida, dependiendo de suposiciones normativas sobre las identidades sexuales y románticas, lo que plantea preocupaciones sobre la operacio…

Les avancées récentes en traitement du langage naturel (NLP) ont conduit à une utilisation généralisée des modèles linguistiques, suscitant des recherches sur la réflexion et l'amplification des biais sociaux, y compris les biais de genre et raciaux. Cependant, il existe un écart notable dans l'analyse de la représentation des sexualités queer dans les systèmes NLP. Une enquête sur 55 articles révèle que la sexualité est souvent mal définie, reposant sur des hypothèses normatives concernant les identités sexuelles et romantiques, ce qui soulève des préoccupations quant à l'opérationnalisation …

Recent advancements in Natural Language Processing (NLP) have led to the widespread use of language models, prompting research into the reflection and amplification of social biases, including gender and racial bias. However, there is a notable gap in the analysis of how queer sexualities are represented in NLP systems. A survey of 55 articles reveals that sexuality is often poorly defined, relying on normative assumptions about sexual and romantic identities, which raises concerns about the operationalization of sexuality in NLP bias research.

Theories of "Sexuality" in Natural Language Processing Bias Research

arXiv:2503.11858v3 Announce Type: replace 
Abstract: Large Language Models (LLMs) have demonstrated great potential as evaluators of NLG systems, allowing for high-quality, reference-free, and multi-aspect assessments. However, existing LLM-based metrics suffer from two major drawbacks: reliance on proprietary models to generate training data or perform evaluations, and a lack of fine-grained, explanatory feedback. In this paper, we introduce OpeNLGauge, a fully open-source, reference-free NLG evaluation metric that provides accurate explanations based on error spans. OpeNLGauge is available as a two-stage ensemble of larger open-weight LLMs, or as a small fine-tuned evaluation model, with confirmed generalizability to unseen tasks, domains and aspects. Our extensive meta-evaluation shows that OpeNLGauge achieves competitive correlation with human judgments, outperforming state-of-the-art models on certain tasks while maintaining full reproducibility and providing explanations more than twice as accurate.

OpeNLGauge هي مقياس مفتوح المصدر جديد لتقييم أنظمة توليد اللغة الطبيعية (NLG) باستخدام نماذج اللغة الكبيرة (LLMs). على عكس المقاييس الحالية التي تعتمد على نماذج ملكية، يوفر OpeNLGauge تقييمات بدون مرجع ويقدم تفسيرات دقيقة تعتمد على نطاقات الأخطاء. تم تصميمه ليكون قابلاً للتكيف مع مهام ومجالات متنوعة، حيث يظهر ارتباطًا تنافسيًا مع أحكام البشر ويتفوق على بعض النماذج المتطورة مع ضمان القابلية للتكرار.

OpeNLGauge es una nueva métrica de código abierto para la evaluación de sistemas de Generación de Lenguaje Natural (NLG) que utiliza Modelos de Lenguaje de Gran Tamaño (LLMs). A diferencia de las métricas existentes que dependen de modelos propietarios, OpeNLGauge ofrece evaluaciones sin referencia y proporciona explicaciones detalladas basadas en rangos de error. Está diseñada para ser adaptable a diversas tareas y dominios, mostrando una correlación competitiva con los juicios humanos y superando a algunos modelos de última generación, garantizando la reproducibilidad.

OpeNLGauge est une nouvelle métrique open-source pour l'évaluation des systèmes de génération de langage naturel (NLG) utilisant des modèles de langage de grande taille (LLM). Contrairement aux métriques existantes qui dépendent de modèles propriétaires, OpeNLGauge propose des évaluations sans référence et fournit des explications détaillées basées sur des plages d'erreurs. Elle est conçue pour être adaptable à diverses tâches et domaines, montrant une corrélation compétitive avec les jugements humains et surpassant certains modèles à la pointe de la technologie tout en garantissant la reprodu…

OpeNLGauge is a newly introduced open-source metric for evaluating Natural Language Generation (NLG) systems using Large Language Models (LLMs). Unlike existing metrics that depend on proprietary models, OpeNLGauge offers reference-free evaluations and provides detailed explanations based on error spans. It is designed to be adaptable to various tasks and domains, demonstrating competitive correlation with human judgments and outperforming some state-of-the-art models while ensuring reproducibility.

OpeNLGauge: An Explainable Metric for NLG Evaluation with Open-Weights LLMs

arXiv:2511.14112v1 Announce Type: new 
Abstract: Automatic ICD coding from clinical text is a critical task in medical NLP but remains hindered by the extreme long-tail distribution of diagnostic codes. Thousands of rare and zero-shot ICD codes are severely underrepresented in datasets like MIMIC-III, leading to low macro-F1 scores. In this work, we propose a data-centric framework that generates high-quality synthetic discharge summaries to mitigate this imbalance. Our method constructs realistic multi-label code sets anchored on rare codes by leveraging real-world co-occurrence patterns, ICD descriptions, synonyms, taxonomy, and similar clinical notes. Using these structured prompts, we generate 90,000 synthetic notes covering 7,902 ICD codes, significantly expanding the training distribution. We fine-tune two state-of-the-art transformer-based models, PLM-ICD and GKI-ICD, on both the original and extended datasets. Experiments show that our approach modestly improves macro-F1 while maintaining strong micro-F1, outperforming prior SOTA. While the gain may seem marginal relative to the computational cost, our results demonstrate that carefully crafted synthetic data can enhance equity in long-tail ICD code prediction.

يعد الترميز التلقائي لرموز ICD من النصوص السريرية أمرًا ضروريًا في معالجة اللغة الطبيعية الطبية، ولكنه يواجه تحديات بسبب توزيع الرموز التشخيصية الطويلة. العديد من رموز ICD النادرة ممثلة تمثيلًا ناقصًا في مجموعات البيانات مثل MIMIC-III، مما يؤدي إلى انخفاض درجات macro-F1. يقدم هذا العمل إطارًا مركزيًا للبيانات يولد ملخصات خروج اصطناعية عالية الجودة للتخفيف من هذا الخلل. باستخدام أنماط التواجد الواقعية وموارد أخرى، يتم إنتاج 90,000 ملاحظة اصطناعية تغطي 7,902 رمز ICD، مما يزيد بشكل كبير من توزيع التدريب. يُظهر ضبط النموذجين PLM-ICD وGKI-ICD على هذه المجموعات من البيانات تحسينات متواضعة في درجات m…

El codificación automática de ICD a partir de textos clínicos es esencial en el procesamiento del lenguaje natural médico, pero enfrenta desafíos debido a la distribución de larga cola de los códigos diagnósticos. Muchos códigos ICD raros están subrepresentados en conjuntos de datos como MIMIC-III, lo que resulta en bajos puntajes macro-F1. Este trabajo presenta un marco centrado en los datos que genera resúmenes de alta calidad para mitigar este desequilibrio. Utilizando patrones de co-ocurrencia del mundo real y otros recursos, se generan 90,000 notas sintéticas que cubren 7,902 códigos ICD,…

Le codage automatique des ICD à partir de textes cliniques est essentiel en NLP médical, mais il est confronté à des défis en raison de la distribution longue traîne des codes diagnostiques. De nombreux codes ICD rares sont sous-représentés dans des ensembles de données comme MIMIC-III, entraînant de faibles scores macro-F1. Ce travail propose un cadre centré sur les données qui génère des résumés de sortie synthétiques pour remédier à ce problème. En utilisant des modèles de co-occurrence du monde réel et d'autres ressources, la méthode produit 90 000 notes synthétiques pour 7 902 codes ICD, …

Automatic ICD coding from clinical text is essential in medical NLP but faces challenges due to the long-tail distribution of diagnostic codes. Many rare ICD codes are underrepresented in datasets like MIMIC-III, resulting in low macro-F1 scores. This work introduces a data-centric framework that generates synthetic discharge summaries to address this issue. By utilizing real-world co-occurrence patterns and other resources, the method produces 90,000 synthetic notes for 7,902 ICD codes, enhancing the training distribution. Fine-tuning of PLM-ICD and GKI-ICD models on these datasets shows mode…

Mitigating Label Length Bias in Large Language Models

Was this article worth reading? Share it