arXiv:2511.08298v1 Announce Type: new 
Abstract: This work investigates the ability of Vision Large Language Models (VLLMs) to understand and interpret the structure of tables in scientific articles. Specifically, we explore whether VLLMs can infer the hierarchical structure of tables without additional processing. As a basis for our experiments we use the PubTables-1M dataset, a large-scale corpus of scientific tables. From this dataset, we extract a subset of tables that we introduce as Complex Hierarchical Tables (CHiTab): a benchmark collection of complex tables containing hierarchical headings. We adopt a series of prompt engineering strategies to probe the models' comprehension capabilities, experimenting with various prompt formats and writing styles. Multiple state-of-the-art open-weights VLLMs are evaluated on the benchmark first using their off-the-shelf versions and then fine-tuning some models on our task. We also measure the performance of humans to solve the task on a small set of tables comparing with performance of the evaluated VLLMs. The experiments support our intuition that generic VLLMs, not explicitly designed for understanding the structure of tables, can perform this task. This study provides insights into the potential and limitations of VLLMs to process complex tables and offers guidance for future work on integrating structured data understanding into general-purpose VLLMs.

تدرس دراسة حديثة قدرة نماذج اللغة الكبيرة البصرية (VLLMs) على فهم الهيكل الهرمي للجداول في المقالات العلمية. باستخدام مجموعة بيانات PubTables-1M، أنشأ الباحثون معيارًا للجداول الهرمية المعقدة (CHiTab) لتقييم أداء VLLMs مقارنةً بفهم البشر. تشير النتائج إلى أن VLLMs العامة يمكن أن تفسر بنجاح هياكل الجداول، مما له آثار على تحسين فهم البيانات في أنظمة الذكاء الاصطناعي.

Un estudio reciente investiga la capacidad de los Modelos de Lenguaje Visuales (VLLMs) para entender la estructura jerárquica de las tablas en artículos científicos. Utilizando el conjunto de datos PubTables-1M, los investigadores crearon un benchmark de Tablas Jerárquicas Complejas (CHiTab) para evaluar el rendimiento de los VLLMs en comparación con la comprensión humana. Los hallazgos sugieren que los VLLMs genéricos pueden interpretar eficazmente las estructuras de las tablas, lo que tiene implicaciones para mejorar la comprensión de datos en sistemas de IA.

Une étude récente examine la capacité des modèles de langage visuels (VLLMs) à comprendre la structure hiérarchique des tableaux dans les articles scientifiques. En utilisant le jeu de données PubTables-1M, les chercheurs ont créé un benchmark de tableaux hiérarchiques complexes (CHiTab) pour évaluer les performances des VLLMs par rapport à la compréhension humaine. Les résultats suggèrent que les VLLMs génériques peuvent interpréter efficacement les structures de tableaux, ce qui a des implications pour améliorer la compréhension des données dans les systèmes d'IA.

A recent study investigates the capability of Vision Large Language Models (VLLMs) to understand the hierarchical structure of tables in scientific articles. Using the PubTables-1M dataset, researchers created a benchmark of Complex Hierarchical Tables (CHiTab) to evaluate VLLMs' performance against human comprehension. The findings suggest that generic VLLMs can effectively interpret table structures, which has implications for enhancing data understanding in AI systems.

Hierarchical structure understanding in complex tables with VLLMs: a benchmark and experiments

حقق Google Gemini 3 إنجازًا كبيرًا من خلال تجاوز جميع معايير الذكاء الاصطناعي الحالية، بما في ذلك الأكثر تحديًا. يبرز هذا الإنجاز التقدم الذي حققته فريق الذكاء الاصطناعي في Google ويضع Gemini 3 كمتنافس رائد في مجال الذكاء الاصطناعي. يتم الاحتفال بهذا النجاح داخل مجتمع التكنولوجيا، مما يعكس التطور المستمر لتقنيات الذكاء الاصطناعي وقدراتها.

Google Gemini 3 ha logrado un hito significativo al superar todos los benchmarks de IA existentes, incluidos los más desafiantes. Este logro resalta los avances realizados por el equipo de IA de Google y posiciona a Gemini 3 como un competidor líder en el panorama de la inteligencia artificial. El éxito es celebrado dentro de la comunidad tecnológica, reflejando la evolución continua de las tecnologías de IA y sus capacidades.

Google Gemini 3 a atteint un jalon significatif en surpassant tous les benchmarks d'IA existants, y compris les plus difficiles. Cet accomplissement met en évidence les avancées réalisées par l'équipe d'IA de Google et positionne Gemini 3 comme un concurrent de premier plan dans le paysage de l'intelligence artificielle. Le succès est célébré au sein de la communauté technologique, reflétant l'évolution continue des technologies d'IA et de leurs capacités.

Google Gemini 3 has achieved a significant milestone by surpassing all existing AI benchmarks, including the most challenging ones. This accomplishment highlights the advancements made by Google's AI team and positions Gemini 3 as a leading contender in the artificial intelligence landscape. The success is celebrated within the tech community, reflecting the ongoing evolution of AI technologies and their capabilities.

Google Gemini 3 Just Killed Every AI Benchmark, Including the Hardest of All

Two weeks ago I read a line about tool use with Claude that stuck in my head. Paraphrased:

<blockquote>
Direct tool calls don’t really scale. 
Have the model write code that uses tools, and execute that code instead.
</blockquote>

At the same time, I was knee-deep in wiring a JavaScript execution environment into Contenox, my self-hosted runtime for deterministic, chat-native AI workflows.

So of course the thought was:

<blockquote>
What if I just let the model write the JavaScript and run it inside the runtime? 😅
</blockquote>

This post is about what happened when I tried exactly that.




<h2>
 
 
 What is Contenox?
</h2>

Very short version:

<blockquote>
Contenox is a self-hostable runtime for sovereign GenAI applications. 
It models AI behavior as explicit state machines, not opaque prompt chains.
</blockquote>

Some key properties:

Runtime, not a library, Explicit state machines, Chat-native interface, Vendor-agnostic &amp; self-hosted, written in go with lots of passion and zero tolerance for shortcuts




<h2>
 
 
 The experiment: ask it to fetch and summarize a TODO
</h2>

Once the JS execution hook was in place (a Goja VM with some globals exposed), I wired up a new state machine: 


<div class="highlight js-code-highlight">
<pre class="highlight plaintext"><code>mux_input → moderate → generate_js → run_js → eval_js_result → (repair_js?) → answer
</code></pre>

</div>



Then I opened the Contenox chat UI and typed:

<blockquote>
Me: “Hey Contenox, fetch the TODO item from jsonplaceholder and summarize it.”
</blockquote>

The first attempt didn’t go as planned.

Instead of a neat summary, I got:

<blockquote>
Contenox: 
“I'm sorry, but it seems there was a problem fetching the TODO item from the provided URL due to an Internal Server Error (status code: 500). Unfortunately, I couldn't retrieve any data at this time. My apologies for the inconvenience!”
</blockquote>

I checked manually – Cloudflare was having a moment, JSONPlaceholder was behind it, and the request really was failing. Okay, fair. Long coffee break.

A bit later, I went back to the same chat and typed:

<blockquote>
Me: “okay let's try again now.”
</blockquote>

This time the reply was:

<blockquote>
Contenox: 
<code>Task titled "delectus aut autem" is incomplete. Nothing to summarize here.</code>
</blockquote>

Wait. It worked? Just like that?




<h2>
 
 
 Under the hood: LLM → JavaScript → tools → LLM
</h2>

Looking at the execution trace made it clear what happened.

<h3>
 
 
 1. Input is normalized and moderated
</h3>

The workflow starts with:

<ol>
<li><code>mux_input</code></li>
</ol>

<ul>
<li>
Parses the incoming chat history into a normalized <code>messages</code> array.

<ol>
<li><code>moderate</code></li>
</ol>
</li>
<li>Uses a small model to classify the input as safe/unsafe.</li>
<li>Output: <code>0</code> → safe → continue.</li>
</ul>

So far this is standard workflow stuff.

<h3>
 
 
 2. <code>generate_js</code>: the LLM writes JavaScript
</h3>

Next, the <code>generate_js</code> task runs with a system instruction like:

<ul>
<li>“You are an expert JavaScript agent for the Contenox runtime…”</li>
<li>“You must output JSON: { "code": "" }”</li>
<li>“You can use httpFetch, executeTask, executeTaskChain, …”</li>
<li>“Script must be synchronous, no async/await, and must return a JSON-serializable object.”</li>
</ul>

Given the chat history, the model produced something like this (simplified): 


<div class="highlight js-code-highlight">
<pre class="highlight json"><code>{
 "code": "var result;
var answer;
var meta = { steps: [] };

var todoUrl = \"https://jsonplaceholder.typicode.com/todos/1\";

// Step 1: Fetch the TODO item
meta.steps.push(\"Fetching TODO item from \" + todoUrl);
var fetchResponse = httpFetch(todoUrl);

if (!fetchResponse.ok) {
 answer = \"I couldn't access that item right now because there was some trouble with getting it. Please check the URL or try again later.\";
 meta.error = {
 message: \"HTTP fetch failed\",
 url: todoUrl,
 status: fetchResponse.status,
 statusText: fetchResponse.statusText,
 body: fetchResponse.body,
 error: fetchResponse.error
 };
 result = { answer: answer, meta: meta };
} else {
 var todoItem;
 try {
 todoItem = JSON.parse(fetchResponse.body);
 meta.raw = todoItem;
 } catch (e) {
 answer = \"I fetched the item, but I couldn't understand its format. It doesn't seem to be valid JSON.\";
 meta.error = {
 message: \"JSON parse error\",
 rawBody: fetchResponse.body,
 parseError: e.toString()
 };
 result = { answer: answer, meta: meta };
 }

 if (todoItem) {
 // Step 2: Summarize the TODO item using an LLM
 meta.steps.push(\"Summarizing TODO item with LLM\");
 var prompt = \"Summarize the following TODO item in a concise sentence: \" + JSON.stringify(todoItem);
 var summarizeTask = executeTask(prompt, \"gpt-4o-mini\", \"openai\");

 if (!summarizeTask.success) {
 answer = \"While I managed to fetch your TODO details, the summary tool had a hiccup and couldn't provide me with a concise sentence right now. We seem to be experiencing some temporary issues.\";
 meta.error = {
 message: \"LLM task execution failed\",
 task_id: summarizeTask.task_id,
 error: summarizeTask.error
 };
 result = { answer: answer, meta: meta };
 } else {
 answer = summarizeTask.response;
 meta.llmResponse = summarizeTask.response;
 result = { answer: answer, meta: meta };
 }
 }
}

result;"
}
</code></pre>

</div>



A few important details here:

<ul>
<li>
It encodes error handling:

<ul>
<li>If HTTP fails → user-friendly <code>answer</code>, raw error in <code>meta.error</code>.</li>
<li>If JSON parse fails → similar pattern.</li>
<li>If the inner LLM call fails → fallback message.</li>
</ul>


</li>

<li>It delegates the actual summarization to another model via <code>executeTask</code>.</li>

<li>It returns a structured <code>result</code> with both <code>answer</code> and <code>meta</code>.</li>

</ul>

This is not the model “calling tools” directly. It’s the model writing a program that calls tools.

<h3>
 
 
 3. <code>run_js</code>: execute the code in a sandbox
</h3>

The next task is <code>run_js</code>, which is just a Contenox <code>hook</code> that calls the JS sandbox: 


<div class="highlight js-code-highlight">
<pre class="highlight json"><code>{
 "name": "js_sandbox",
 "tool_name": "execute_js",
 "args": {
 "code": "{{.generate_js.code}}"
 }
}
</code></pre>

</div>



Inside the trace you can see:

<ul>
<li>An <code>httpFetch</code> log for the JSONPlaceholder URL.</li>
<li>A response with <code>status: 200 OK</code> when things finally worked.</li>
<li>
An <code>executeTask</code> log with the summarization prompt:

<ul>
<li><code>Summarize the following TODO item in a concise sentence: {"userId":1,"id":1,"title":"delectus aut autem","completed":false}</code></li>
</ul>


</li>

</ul>

The sandbox result looked roughly like: 


<div class="highlight js-code-highlight">
<pre class="highlight json"><code>{
 "ok": true,
 "result": {
 "answer": "Task titled \"delectus aut autem\" is incomplete.",
 "meta": {
 "llmResponse": "Task titled \"delectus aut autem\" is incomplete.",
 "raw": {
 "userId": 1,
 "id": 1,
 "title": "delectus aut autem",
 "completed": false
 },
 "steps": [
 "Fetching TODO item from https://jsonplaceholder.typicode.com/todos/1",
 "Summarizing TODO item with LLM"
 ]
 }
 },
 "logs": [ ... ],
 "code": "var result; ..."
}
</code></pre>

</div>



<h3>
 
 
 4. <code>eval_js_result</code>: success or retry?
</h3>

Now comes the evaluator:

<ul>
<li>It receives a description of the JS sandbox output.</li>
<li>
The system prompt is very strict:

<ul>
<li>If <code>ok</code> is true and there is a non-empty <code>result.answer</code> → respond with <code>success</code>.</li>
<li>Otherwise → respond with <code>retry</code>.</li>
</ul>


</li>

</ul>

On the successful run, it answered: 


<div class="highlight js-code-highlight">
<pre class="highlight plaintext"><code>success
</code></pre>

</div>



So the workflow does not go into <code>repair_js</code> or <code>run_js_retry</code>. Happy path.

<h3>
 
 
 5. <code>answer</code>: extract the final user message
</h3>

The final task, <code>answer</code>, is intentionally boring:

<ul>
<li>System prompt: “You are a purely extractive post-processor. Do NOT invent content. Just surface the best existing <code>answer</code> field.”
</li>
<li>
It gets:

<ul>
<li>First run (<code>run_js</code> result).</li>
<li>Second run (<code>run_js_retry</code>), if any.</li>
</ul>


</li>

<li>

Selection rule:

<ul>
<li>Take the last non-empty <code>answer</code> you see.</li>
<li>Output it verbatim.</li>
</ul>


</li>

</ul>

In our case it found: 


<div class="highlight js-code-highlight">
<pre class="highlight plaintext"><code>Task titled "delectus aut autem" is incomplete.
</code></pre>

</div>



And that’s exactly what Contenox replied in chat.




<h2>
 
 
 Why this is interesting (to me, at least)
</h2>

What I originally set out to build:

<blockquote>
A runtime for deterministic, observable GenAI workflows. 
Tasks, transitions, hooks – all explicit and replayable.
</blockquote>

What I accidentally stumbled into:

<blockquote>
A multi-model, self-orchestrating agent pattern, 
where LLMs write code that uses tools, and the runtime executes and evaluates that code.
</blockquote>

The pattern looks like this:

<ol>
<li>
Planner LLM (<code>generate_js</code>)</li>
</ol>

<ul>
<li>Reads user intent + history.</li>
<li>Emits JavaScript that calls <code>httpFetch</code>, <code>executeTask</code>, <code>executeTaskChain</code>, hooks, etc.</li>
</ul>

<ol>
<li>
Execution environment (<code>run_js</code> in Goja)</li>
</ol>

<ul>
<li>Deterministic execution of that JS.</li>
<li>Full logs of every HTTP call, every inner LLM call, every step.</li>
</ul>

<ol>
<li>
Controller LLM (<code>eval_js_result</code>)</li>
</ol>

<ul>
<li>Looks at the sandbox result.</li>
<li>Decides: is this good enough? Retry? Repair?</li>
</ul>

<ol>
<li>
Repair LLM (<code>repair_js</code>, if needed)</li>
</ol>

<ul>
<li>Gets the previous code + error output.</li>
<li>Writes a fixed version of the JS.</li>
</ul>

<ol>
<li>
Answer LLM (<code>answer</code>)</li>
</ol>

<ul>
<li>Doesn’t “reason” at all.</li>
<li>Just extracts the final <code>answer</code> text safely.</li>
</ul>

All of that is expressed as an explicit state machine in Contenox.

No hidden loops, no undocumented retries, no magic glue code inside some SDK. It’s all visible in the workflow graph and trace.




To me, that’s the exciting part:

<blockquote>
You don’t have to choose between “boring deterministic workflows” and “fancy agents”. 
You can build the agent on top of deterministic workflows. 
And everything stays **self-hosted, inspectable, and auditable if you want.
</blockquote>

يتناول المقال تجربة تم فيها السماح لنموذج ذكاء اصطناعي بكتابة كود جافا سكريبت داخل بيئة مستقلة تُدعى كونتينوكس. يتأمل الكاتب في مفهوم يتعلق باستخدام الأدوات في الذكاء الاصطناعي، مقترحًا أن النماذج يجب أن تولد كودًا لاستخدام الأدوات بدلاً من إجراء مكالمات مباشرة. تم اختبار هذا النهج من خلال تنفيذ كود جافا سكريبت الذي تم إنشاؤه داخل بيئة كونتينوكس، بهدف تحسين كفاءة سير العمل في الذكاء الاصطناعي.

El artículo discute un experimento en el que se permitió a un modelo de IA escribir código JavaScript dentro de un entorno autónomo llamado Contenox. El autor reflexiona sobre un concepto relacionado con el uso de herramientas en IA, sugiriendo que los modelos deberían generar código para utilizar herramientas en lugar de realizar llamadas directas. Este enfoque se probó ejecutando el JavaScript generado dentro del entorno Contenox, con el objetivo de mejorar la eficiencia de los flujos de trabajo de IA.

L'article traite d'une expérience où un modèle d'IA a été autorisé à écrire du code JavaScript au sein d'un environnement autonome appelé Contenox. L'auteur réfléchit à un concept concernant l'utilisation des outils en IA, suggérant que les modèles devraient générer du code pour utiliser des outils plutôt que d'effectuer des appels directs. Cette approche a été testée en exécutant le JavaScript généré dans l'environnement Contenox, visant à améliorer l'efficacité des flux de travail en IA.

The article discusses an experiment where an AI model was allowed to write JavaScript code within a self-hosted runtime called Contenox. The author reflects on a concept regarding tool usage in AI, suggesting that models should generate code to utilize tools instead of direct calls. This approach was tested by executing the generated JavaScript within the Contenox environment, aiming to enhance the efficiency of AI workflows.

I Let an LLM Write JavaScript Inside My AI Runtime. Here’s What Happened

Caterpillar Inc. was always an unlikely winner in the artificial intelligence craze. It makes the bulk of its money selling the equipment like yellow earth movers that has made it a stalwart of American industry.

تظهر شركة كاتربيلر كجهة غير متوقعة في قطاع الذكاء الاصطناعي، حيث تُعرف أساسًا بتصنيع الآلات الثقيلة مثل آلات الحفر. تاريخيًا، كانت الشركة تركز على المعدات الصناعية التقليدية، مما يجعلها أقل توافقًا مع الاتجاهات التكنولوجية المدفوعة بالذكاء الاصطناعي التي اجتذبت العديد من القطاعات الأخرى. على الرغم من الاهتمام المتزايد بالذكاء الاصطناعي، لا يزال النشاط الرئيسي لشركة كاتربيلر متجذرًا في الآلات المادية، مما قد يحد من جاذبيتها في مشهد تكنولوجي سريع التطور.

Caterpillar Inc. se ha presentado como un jugador improbable en el sector de la inteligencia artificial, siendo principalmente conocida por la fabricación de maquinaria pesada como las excavadoras. La compañía ha estado históricamente enfocada en equipos industriales tradicionales, lo que la hace menos alineada con las tendencias tecnológicas impulsadas por la IA que han cautivado a muchos otros sectores. A pesar del creciente interés en la IA, el negocio principal de Caterpillar sigue anclado en la maquinaria física, lo que podría limitar su atractivo en un paisaje tecnológico en rápida evolu…

Caterpillar Inc. se présente comme un acteur improbable dans le secteur de l'intelligence artificielle, étant principalement connu pour sa fabrication de machines lourdes telles que les pelles mécaniques. L'entreprise s'est historiquement concentrée sur l'équipement industriel traditionnel, ce qui la rend moins alignée avec les tendances technologiques axées sur l'IA qui ont captivé de nombreux autres secteurs. Malgré l'intérêt croissant pour l'IA, le cœur de métier de Caterpillar reste ancré dans la machinerie physique, ce qui pourrait limiter son attrait dans un paysage technologique en évol…

Caterpillar Inc. has emerged as an unlikely player in the artificial intelligence sector, primarily known for its manufacturing of heavy machinery such as earth movers. The company has historically focused on traditional industrial equipment, making it less aligned with the AI-driven technology trends that have captivated many other sectors. Despite the growing interest in AI, Caterpillar's core business remains rooted in physical machinery, which may limit its appeal in the rapidly evolving tech landscape.

Caterpillar’s Lone Bear Says Machinery Maker Is No AI Darling

<a href="https://www.techspot.com/news/110306-microsoft-explains-how-windows-11-become-agentic-os.html" target="_blank"><img src="https://www.techspot.com/images2/news/ts3_thumbs/2025/11/2025-11-18-ts3_thumbs-d01.jpg" width="800" height="560" style="padding: 15px 0" title="Microsoft explains how Windows 11 will become an agentic OS whether you like it or not" /></a> Windows president Pavan Davuluri recently described the future of Windows as an agentic operating system, where AI bots and large language models handle the user's commands on files and computing tasks. Critics mostly greeted the idea with scorn, cursing, and frustration over the "bug-ridden slop pile" the OS currently is.... <a href="https://www.techspot.com/news/110306-microsoft-explains-how-windows-11-become-agentic-os.html">Read Entire Article</a>

وصف رئيس شركة مايكروسوفت، بavan دافولوري، خطط ويندوز 11 للتطور إلى ما يسميه 'نظام تشغيل وكيل'. ستتضمن هذه التحويلة دمج روبوتات الذكاء الاصطناعي ونماذج اللغة الكبيرة لإدارة أوامر المستخدم ومهام الحوسبة. ومع ذلك، تم استقبال هذا الإعلان بسخرية وانتقادات من المستخدمين الذين يشعرون بالإحباط من الحالة الحالية لنظام التشغيل، الذي يصفونه بأنه مليء بالأخطاء.

El presidente de Microsoft, Pavan Davuluri, ha descrito los planes para que Windows 11 evolucione hacia lo que él llama un 'sistema operativo agente'. Esta transformación implicará la integración de bots de IA y modelos de lenguaje para gestionar los comandos y tareas informáticas del usuario. Sin embargo, el anuncio ha sido recibido con escepticismo y críticas por parte de los usuarios, que están frustrados con el estado actual del sistema operativo, que describen como plagado de errores.

Le président de Microsoft, Pavan Davuluri, a décrit les projets de Windows 11 pour évoluer vers ce qu'il appelle un 'système d'exploitation agentique'. Cette transformation impliquera l'intégration de bots IA et de modèles de langage pour gérer les commandes des utilisateurs et les tâches informatiques. Cependant, cette annonce a été accueillie avec scepticisme et critiques de la part des utilisateurs frustrés par l'état actuel du système d'exploitation, qu'ils décrivent comme rempli de bogues.

Microsoft's president, Pavan Davuluri, has outlined plans for Windows 11 to evolve into what he describes as an 'agentic operating system.' This transformation will involve the integration of AI bots and large language models to manage user commands and computing tasks. However, the announcement has been met with skepticism and criticism from users who are frustrated with the current state of the operating system, which they describe as plagued by bugs.

Microsoft explains how Windows 11 will become an agentic OS whether you like it or not

Although Black Friday is still two weeks away, you can find great Nintendo Switch and Switch 2 deals now. I've collected the best from Walmart, Best Buy, and more.

مع اقتراب يوم الجمعة السوداء بعد أسبوعين، تتوفر بالفعل عروض مبكرة على أجهزة نينتندو سويتش وسويتش 2. تقدم متاجر كبيرة مثل وول مارت وبيست باي أكثر من 20 عرضًا، مما يوفر للمستهلكين فرصة لتوفير المال على منتجات الألعاب الشهيرة قبل موسم التسوق للعطلات.

A medida que se acerca el Black Friday en dos semanas, ya están disponibles ofertas anticipadas en las consolas Nintendo Switch y Switch 2. Grandes minoristas como Walmart y Best Buy están ofreciendo más de 20 ventas, brindando a los consumidores la oportunidad de ahorrar en productos de videojuegos populares antes de la locura de compras navideñas.

À l'approche de Black Friday dans deux semaines, des offres anticipées sur les consoles Nintendo Switch et Switch 2 sont déjà disponibles. Des détaillants majeurs comme Walmart et Best Buy proposent plus de 20 ventes, offrant aux consommateurs l'occasion d'économiser sur des produits de jeu populaires avant la ruée des achats de vacances.

As Black Friday approaches in two weeks, early deals on Nintendo Switch and Switch 2 consoles are already available. Major retailers like Walmart and Best Buy are offering over 20 sales, providing consumers with an opportunity to save on popular gaming products ahead of the holiday shopping rush.

Best early Black Friday Nintendo Switch deals 2025: 20+ sales out early

<A HREF="https://substack.com/redirect/2/eyJlIjoiaHR0cHM6Ly9zb3VyY2VzLm5ld3MvcC9kZW1pcy1oYXNzaWJhcy1vbi1nZW1pbmktMy13b3JsZD91dG1fY2FtcGFpZ249ZW1haWwtcG9zdCZyPTFyODVmJnRva2VuPWV5SjFjMlZ5WDJsa0lqb3lPVFE1T0RreExDSndiM04wWDJsa0lqb3hOemt5TlRnNE9UZ3NJbWxoZENJNk1UYzJNelE1TmpVME5pd2laWGh3SWpveE56WTJNRGc0TlRRMkxDSnBjM01pT2lKd2RXSXRNelV5TlRjNE1DSXNJbk4xWWlJNkluQnZjM1F0Y21WaFkzUnBiMjRpZlEucDdlcWFuMFM3WDNXTXQ1OUY4Y2RYZG1tb1VBRGJNMlBGZDM1c3ZZWUc2YyIsInAiOjE3OTI1ODg5OCwicyI6MzUyNTc4MCwiZiI6ZmFsc2UsInUiOjI5NDk4OTEsImlhdCI6MTc2MzQ5NjU0NiwiZXhwIjoyMDc5MDcyNTQ2LCJpc3MiOiJwdWItMCIsInN1YiI6ImxpbmstcmVkaXJlY3QifQ.MstA6dhKs3CLxJgLSEpyVK_D4Oz9SmcjeXKNnqEDzIg?"><IMG VSPACE="4" HSPACE="4" BORDER="0" ALIGN="RIGHT" SRC="http://www.techmeme.com/251118/i47.jpg"></A>
<A HREF="http://www.techmeme.com/251118/p47#a251118p47" TITLE="Techmeme permalink"><IMG WIDTH=11 HEIGHT=12 SRC="http://www.techmeme.com/img/pml.png" STYLE="border:none;padding:0;margin:0;"></A> Alex Heath / <A HREF="https://sources.news/">Sources</A>: 
<A HREF="https://substack.com/redirect/2/eyJlIjoiaHR0cHM6Ly9zb3VyY2VzLm5ld3MvcC9kZW1pcy1oYXNzaWJhcy1vbi1nZW1pbmktMy13b3JsZD91dG1fY2FtcGFpZ249ZW1haWwtcG9zdCZyPTFyODVmJnRva2VuPWV5SjFjMlZ5WDJsa0lqb3lPVFE1T0RreExDSndiM04wWDJsa0lqb3hOemt5TlRnNE9UZ3NJbWxoZENJNk1UYzJNelE1TmpVME5pd2laWGh3SWpveE56WTJNRGc0TlRRMkxDSnBjM01pT2lKd2RXSXRNelV5TlRjNE1DSXNJbk4xWWlJNkluQnZjM1F0Y21WaFkzUnBiMjRpZlEucDdlcWFuMFM3WDNXTXQ1OUY4Y2RYZG1tb1VBRGJNMlBGZDM1c3ZZWUc2YyIsInAiOjE3OTI1ODg5OCwicyI6MzUyNTc4MCwiZiI6ZmFsc2UsInUiOjI5NDk4OTEsImlhdCI6MTc2MzQ5NjU0NiwiZXhwIjoyMDc5MDcyNTQ2LCJpc3MiOiJwdWItMCIsInN1YiI6ImxpbmstcmVkaXJlY3QifQ.MstA6dhKs3CLxJgLSEpyVK_D4Oz9SmcjeXKNnqEDzIg?">Q&amp;A with Demis Hassabis on Gemini 3, spending most of his research time on world models, fitting the entire Google Search index into Gemini, AI bubble, and more</A>&nbsp; &mdash;&nbsp; Demis Hassabis was noticeably relaxed when he joined our virtual call from London.&nbsp; &mdash;&nbsp; It was the day before the release of Gemini 3 &hellip;

ناقش ديميس هاسابيس، المؤسس المشارك لشركة ديب مايند، تقدم نموذج جيميني 3، أحدث نموذج للذكاء الاصطناعي من جوجل، مشددًا على قدراته في نمذجة العالم ودمج كامل فهرس بحث جوجل في النظام. تناول المخاوف بشأن فقاعة الذكاء الاصطناعي وأبرز إمكانيات النموذج في تحسين تفاعلات المستخدمين وتوفير معلومات أكثر دقة. تعكس أفكار هاسابيس التزامًا بدفع حدود تكنولوجيا الذكاء الاصطناعي ودمجها في التطبيقات اليومية.

Demis Hassabis, cofundador de DeepMind, discutió los avances de Gemini 3, el último modelo de IA de Google, enfatizando sus capacidades en modelos del mundo y la integración de todo el índice de búsqueda de Google en el sistema. Abordó las preocupaciones sobre la burbuja de la IA y destacó el potencial del modelo para mejorar las interacciones con los usuarios y proporcionar información más precisa. Las ideas de Hassabis reflejan un compromiso por llevar la tecnología de IA a nuevos límites y su integración en aplicaciones cotidianas.

Demis Hassabis, co-fondateur de DeepMind, a discuté des avancées de Gemini 3, le dernier modèle d'IA de Google, en mettant l'accent sur ses capacités en modélisation du monde et l'intégration de l'ensemble de l'index de recherche de Google dans le système. Il a abordé les préoccupations concernant la bulle de l'IA et a souligné le potentiel du modèle à améliorer les interactions avec les utilisateurs et à fournir des informations plus précises. Les réflexions de Hassabis reflètent un engagement à repousser les limites de la technologie IA et son intégration dans les applications quotidiennes.

Demis Hassabis, co-founder of DeepMind, discussed the advancements of Gemini 3, Google's latest AI model, emphasizing its capabilities in world modeling and fitting the entire Google Search index into the system. He addressed concerns about the AI bubble and highlighted the model's potential to enhance user interactions and provide more accurate information. Hassabis's insights reflect a commitment to pushing the boundaries of AI technology and its integration into everyday applications.

Q&A with Demis Hassabis on Gemini 3, spending most of his research time on world models, fitting the entire Google Search index into Gemini, AI bubble, and more (Alex Heath/Sources)

arXiv:2511.11025v1 Announce Type: new 
Abstract: Multimodal Large Language Models (MLLMs) have shown promise in single-agent vision tasks, yet benchmarks for evaluating multi-agent collaborative perception remain scarce. This gap is critical, as multi-drone systems provide enhanced coverage, robustness, and collaboration compared to single-sensor setups. Existing multi-image benchmarks mainly target basic perception tasks using high-quality single-agent images, thus failing to evaluate MLLMs in more complex, egocentric collaborative scenarios, especially under real-world degraded perception conditions.To address these challenges, we introduce AirCopBench, the first comprehensive benchmark designed to evaluate MLLMs in embodied aerial collaborative perception under challenging perceptual conditions. AirCopBench includes 14.6k+ questions derived from both simulator and real-world data, spanning four key task dimensions: Scene Understanding, Object Understanding, Perception Assessment, and Collaborative Decision, across 14 task types. We construct the benchmark using data from challenging degraded-perception scenarios with annotated collaborative events, generating large-scale questions through model-, rule-, and human-based methods under rigorous quality control. Evaluations on 40 MLLMs show significant performance gaps in collaborative perception tasks, with the best model trailing humans by 24.38% on average and exhibiting inconsistent results across tasks. Fine-tuning experiments further confirm the feasibility of sim-to-real transfer in aerial collaborative perception and reasoning.

AirCopBench هو معيار جديد تم تقديمه لتقييم نماذج اللغة متعددة الوسائط (MLLMs) في مهام الإدراك التعاوني باستخدام الطائرات بدون طيار. يعالج هذا المعيار نقص أدوات التقييم الشاملة للأنظمة متعددة الوكلاء، التي تتفوق على إعدادات المستشعر الفردي من حيث التغطية والموثوقية. يتضمن المعيار أكثر من 14,600 سؤال تغطي أبعاد مهام متنوعة، مثل فهم المشهد وفهم الكائن، مصممة لتقييم الأداء في ظروف صعبة.

AirCopBench es un nuevo estándar introducido para evaluar Modelos de Lenguaje Multimodal (MLLMs) en tareas de percepción colaborativa con múltiples drones. Aborda la falta de herramientas de evaluación completas para sistemas multiagente, que superan a las configuraciones de un solo sensor en términos de cobertura y robustez. El estándar incluye más de 14,600 preguntas que abarcan diversas dimensiones de tareas, como Comprensión de Escenas y Comprensión de Objetos, diseñadas para evaluar el rendimiento en condiciones desafiantes.

AirCopBench est une nouvelle référence introduite pour évaluer les Modèles de Langage Multimodaux (MLLMs) dans les tâches de perception collaborative multi-drones. Elle répond au manque d'outils d'évaluation complets pour les systèmes multi-agents, qui surpassent les configurations à capteur unique en termes de couverture et de robustesse. La référence comprend plus de 14 600 questions couvrant diverses dimensions de tâches, telles que la Compréhension de Scène et la Compréhension d'Objet, conçues pour évaluer les performances dans des conditions difficiles.

AirCopBench is a new benchmark introduced to evaluate Multimodal Large Language Models (MLLMs) in multi-drone collaborative perception tasks. It addresses the lack of comprehensive evaluation tools for multi-agent systems, which outperform single-agent setups in terms of coverage and robustness. The benchmark includes over 14,600 questions across various task dimensions, such as Scene Understanding and Object Understanding, designed to assess performance under challenging conditions.

AirCopBench: A Benchmark for Multi-drone Collaborative Embodied Perception and Reasoning

arXiv:2510.24021v2 Announce Type: replace 
Abstract: Knowledge distillation (KD) is a standard route to compress Large Language Models (LLMs) into compact students, yet most pipelines uniformly apply token-wise loss regardless of teacher confidence. This indiscriminate supervision amplifies noisy, high-entropy signals and is especially harmful under large teacher-student capacity gaps. We introduce SelecTKD, a plug-and-play Selective Token-Weighted distillation framework that shifts the focus from "how to measure divergence" to "where to apply learning". At each step, the student proposes tokens that are verified by the teacher through a robust propose-and-verify procedure with two variants: greedy Top-k and non-greedy Spec-k. Accepted tokens receive full loss, while rejected tokens are masked or down-weighted. This objective-agnostic design works with on- and off-policy data, induces an implicit curriculum quantified by Token Acceptance Rate (TAR), and stabilizes optimization. Across instruction following, mathematical reasoning, code generation, and a VLM setting, SelecTKD consistently improves strong baselines and achieves state-of-the-art results for small models without architectural changes or extra reference models.

SelecTKD: Selective Token-Weighted Knowledge Distillation for LLMs

arXiv:2511.13368v1 Announce Type: new 
Abstract: Large language models (LLMs) perform strongly across tasks and languages, yet how improvements in one task or language affect other tasks and languages and their combinations remains poorly understood. We conduct a controlled PEFT/LoRA study across multiple open-weight LLM families and sizes, treating task and language as transfer axes while conditioning on model family and size; we fine-tune each model on a single task-language source and measure transfer as the percentage-point change versus its baseline score when evaluated on all other task-language target pairs. We decompose transfer into (i) Matched-Task (Cross-Language), (ii) Matched-Language (Cross-Task), and (iii) Cross-Task (Cross-Language) regimes. We uncover two consistent general patterns. First, a pronounced on-task vs. off-task asymmetry: Matched-Task (Cross-Language) transfer is reliably positive, whereas off-task transfer often incurs collateral degradation. Second, a stable donor-recipient structure across languages and tasks (hub donors vs. brittle recipients). We outline implications for risk-aware fine-tuning and model specialisation.

Donors and Recipients: On Asymmetric Transfer Across Tasks and Languages with Parameter-Efficient Fine-Tuning

arXiv:2511.11878v1 Announce Type: new 
Abstract: While large language models (LLMs) show transformative potential in healthcare, their development remains focused on high-resource languages, creating a critical barrier for others as simple translation fails to capture unique clinical and cultural nuances, such as endemic diseases. To address this, we introduce MedPT, the first large-scale, real-world corpus for Brazilian Portuguese, comprising 384,095 authentic question-answer pairs from patient-doctor interactions. The dataset underwent a meticulous multi-stage curation protocol, using a hybrid quantitative-qualitative analysis to filter noise and contextually enrich thousands of ambiguous queries. We further augmented the corpus via LLM-driven annotation, classifying questions into seven semantic types to capture user intent. Our analysis reveals its thematic breadth (3,200 topics) and unique linguistic properties, like the natural asymmetry in patient-doctor communication. To validate its utility, we benchmark a medical specialty routing task: fine-tuning a 1.7B parameter model achieves an outstanding 94\% F1-score on a 20-class setup. Furthermore, our qualitative error analysis shows misclassifications are not random but reflect genuine clinical ambiguities (e.g., between comorbid conditions), proving the dataset's deep semantic richness. We publicly release MedPT to foster the development of more equitable, accurate, and culturally-aware medical technologies for the Portuguese-speaking world.

Hierarchical structure understanding in complex tables with VLLMs: a benchmark and experiments

Was this article worth reading? Share it