arXiv:2511.21106v1 Announce Type: new 
Abstract: Efficient Multimodal Large Language Models (MLLMs) compress vision tokens to reduce resource consumption, but the loss of visual information can degrade comprehension capabilities. Although some priors introduce Knowledge Distillation to enhance student models, they overlook the fundamental differences in fine-grained vision comprehension caused by unbalanced vision tokens between the efficient student and vanilla teacher. In this paper, we propose EM-KD, a novel paradigm that enhances the Efficient MLLMs with Knowledge Distillation. To overcome the challenge of unbalanced vision tokens, we first calculate the Manhattan distance between the vision logits of teacher and student, and then align them in the spatial dimension with the Hungarian matching algorithm. After alignment, EM-KD introduces two distillation strategies: 1) Vision-Language Affinity Distillation (VLAD) and 2) Vision Semantic Distillation (VSD). Specifically, VLAD calculates the affinity matrix between text tokens and aligned vision tokens, and minimizes the smooth L1 distance of the student and the teacher affinity matrices. Considering the semantic richness of vision logits in the final layer, VSD employs the reverse KL divergence to measure the discrete probability distributions of the aligned vision logits over the vocabulary space. Comprehensive evaluation on diverse benchmarks demonstrates that EM-KD trained model outperforms prior Efficient MLLMs on both accuracy and efficiency with a large margin, validating its effectiveness. Compared with previous distillation methods, which are equipped with our proposed vision token matching strategy for fair comparison, EM-KD also achieves better performance.

تم تقديم EM-KD، وهو نموذج جديد لتعزيز نماذج اللغة متعددة الوسائط الفعالة (MLLMs)، لمعالجة تحدي الرموز البصرية غير المتوازنة التي يمكن أن تؤدي إلى تدهور قدرات الفهم. من خلال استخدام تقنيات تقطير المعرفة ومحاذاة لوغاريتمات الرؤية باستخدام خوارزمية المطابقة الهنغارية، يهدف EM-KD إلى تحسين كفاءة وفعالية MLLMs في معالجة المعلومات البصرية.

La introducción de EM-KD, un nuevo paradigma para mejorar los Modelos de Lenguaje Multimodal Eficientes (MLLMs), aborda el desafío de los tokens visuales desbalanceados que pueden degradar las capacidades de comprensión. Al emplear la Distilación del Conocimiento y alinear los logits visuales utilizando el algoritmo de emparejamiento húngaro, EM-KD busca mejorar la eficiencia y efectividad de los MLLMs en el procesamiento de información visual.

L'introduction d'EM-KD, un nouveau paradigme pour améliorer les modèles de langage multimodal efficaces (MLLMs), répond au défi des tokens visuels déséquilibrés qui peuvent dégrader les capacités de compréhension. En utilisant la distillation des connaissances et en alignant les logits visuels grâce à l'algorithme d'appariement hongrois, EM-KD vise à améliorer l'efficacité et l'efficacité des MLLMs dans le traitement des informations visuelles.

The introduction of EM-KD, a novel paradigm for enhancing Efficient Multimodal Large Language Models (MLLMs), addresses the challenge of unbalanced vision tokens that can degrade comprehension capabilities. By employing Knowledge Distillation and aligning vision logits using the Hungarian matching algorithm, EM-KD aims to improve the efficiency and effectiveness of MLLMs in processing visual information.

EM-KD: Distilling Efficient Multimodal Large Language Model with Unbalanced Vision Tokens

arXiv:2601.08033v1 Announce Type: new 
Abstract: Graph Neural Networks (GNNs) are the go-to model for graph data analysis. However, GNNs rely on two key operations - aggregation and update, which can pose challenges for low-latency inference tasks or resource-constrained scenarios. Simple Multi-Layer Perceptrons (MLPs) offer a computationally efficient alternative. Yet, training an MLP in a supervised setting often leads to suboptimal performance. Knowledge Distillation (KD) from a GNN teacher to an MLP student has emerged to bridge this gap. However, most KD methods either transfer knowledge uniformly across all nodes or rely on graph-agnostic indicators such as prediction uncertainty. We argue this overlooks a more fundamental, graph-centric inquiry: "How important is a node to the structure of the graph?" We introduce a framework, InfGraND, an Influence-guided Graph KNowledge Distillation from GNN to MLP that addresses this by identifying and prioritizing structurally influential nodes to guide the distillation process, ensuring that the MLP learns from the most critical parts of the graph. Additionally, InfGraND embeds structural awareness in MLPs through one-time multi-hop neighborhood feature pre-computation, which enriches the student MLP's input and thus avoids inference-time overhead. Our rigorous evaluation in transductive and inductive settings across seven homophilic graph benchmark datasets shows InfGraND consistently outperforms prior GNN to MLP KD methods, demonstrating its practicality for numerous latency-critical applications in real-world settings.

تم تقديم إطار جديد يسمى InfGraND لتسهيل تقطير المعرفة الموجه من الشبكات العصبية الرسومية (GNN) إلى الشبكات متعددة الطبقات (MLP). يهدف هذا الإطار إلى تحسين كفاءة MLP من خلال إعطاء الأولوية للعقد الهيكلية المؤثرة في الرسم البياني، مما يعالج التحديات التي تواجهها GNN التقليدية في البيئات ذات الكمون المنخفض والموارد المحدودة.

Se ha introducido un nuevo marco llamado InfGraND para facilitar la distilación de conocimiento guiada por la influencia de las Redes Neuronales de Grafos (GNN) hacia los Perceptrones Multicapa (MLP). Este marco tiene como objetivo mejorar la eficiencia de los MLP priorizando los nodos estructuralmente influyentes en el grafo, abordando así los desafíos que enfrentan las GNN tradicionales en entornos de baja latencia y con recursos limitados.

Un nouveau cadre nommé InfGraND a été introduit pour faciliter la distillation de connaissances guidée par l'influence des réseaux de neurones graphiques (GNN) vers les perceptrons multicouches (MLP). Ce cadre vise à améliorer l'efficacité des MLP en priorisant les nœuds structurellement influents dans le graphe, abordant ainsi les défis rencontrés par les GNN traditionnels dans des environnements à faible latence et à ressources limitées.

A new framework named InfGraND has been introduced to facilitate Influence-guided Knowledge Distillation from Graph Neural Networks (GNNs) to Multi-Layer Perceptrons (MLPs). This framework aims to enhance the efficiency of MLPs by prioritizing structurally influential nodes in the graph, addressing challenges faced by traditional GNNs in low-latency and resource-constrained environments.

InfGraND: An Influence-Guided GNN-to-MLP Knowledge Distillation

One More Thing in AI – Your Shortcut to AI Mastery

EM-KD: Distilling Efficient Multimodal Large Language Model with Unbalanced Vision Tokens

Was this article worth reading? Share it

One More Thing in AI

LucidQuery AI

The Visualizer

Dubsmart LLC

Https

FastML

Ready to build your own newsroom?