arXiv:2511.17727v1 Announce Type: new 
Abstract: Vision-language models (VLMs) have demonstrated remarkable performance across a wide range of computer-vision tasks, sparking interest in their potential for digital health applications. Here, we apply VLMs to two fundamental challenges in data-driven stroke rehabilitation: automatic quantification of rehabilitation dose and impairment from videos. We formulate these problems as motion-identification tasks, which can be addressed using VLMs. We evaluate our proposed framework on a cohort of 29 healthy controls and 51 stroke survivors. Our results show that current VLMs lack the fine-grained motion understanding required for precise quantification: dose estimates are comparable to a baseline that excludes visual information, and impairment scores cannot be reliably predicted. Nevertheless, several findings suggest future promise. With optimized prompting and post-processing, VLMs can classify high-level activities from a few frames, detect motion and grasp with moderate accuracy, and approximate dose counts within 25% of ground truth for mildly impaired and healthy participants, all without task-specific training or finetuning. These results highlight both the current limitations and emerging opportunities of VLMs for data-driven stroke rehabilitation and broader clinical video analysis.

أظهرت نماذج الرؤية-اللغة (VLM) إمكانيات كبيرة في مجموعة متنوعة من مهام رؤية الكمبيوتر، مما أدى إلى تطبيقها في إعادة تأهيل السكتة الدماغية المعتمدة على البيانات لمعالجة تحديات مثل الكمية التلقائية لجرعة إعادة التأهيل وتقييم العجز من مقاطع الفيديو. أظهرت دراسة شملت 29 شخصًا سليمًا و51 ناجٍ من السكتة الدماغية أن النماذج الحالية تعاني من صعوبة في فهم الحركة بدقة، مما يؤدي إلى تقديرات جرعة غير موثوقة ودرجات عجز غير دقيقة.

Los modelos de visión-lenguaje (VLM) han demostrado potencial en diversas tareas de visión por computadora, lo que ha llevado a su aplicación en la rehabilitación de accidentes cerebrovasculares basada en datos para abordar desafíos como la cuantificación automática de la dosis de rehabilitación y la evaluación de discapacidades a partir de videos. Un estudio que involucró a 29 controles sanos y 51 sobrevivientes de accidentes cerebrovasculares reveló que los VLM actuales tienen dificultades para comprender los movimientos de manera detallada, lo que lleva a estimaciones de dosis y puntajes de…

Les modèles de vision-langage (VLM) ont montré un potentiel dans diverses tâches de vision par ordinateur, ce qui a conduit à leur application dans la réhabilitation post-AVC basée sur les données pour relever des défis tels que la quantification automatique de la dose de réhabilitation et l'évaluation des déficits à partir de vidéos. Une étude impliquant 29 témoins sains et 51 survivants d'AVC a révélé que les VLM actuels ont du mal à comprendre les mouvements de manière fine, entraînant des estimations de dose et des scores de déficience peu fiables.

Vision-language models (VLMs) have shown potential in various computer-vision tasks, prompting their application in data-driven stroke rehabilitation to address challenges like automatic quantification of rehabilitation dose and impairment from videos. A study involving 29 healthy controls and 51 stroke survivors revealed that current VLMs struggle with fine-grained motion understanding, leading to unreliable dose estimates and impairment scores.

The Potential and Limitations of Vision-Language Models for Human Motion Understanding: A Case Study in Data-Driven Stroke Rehabilitation

arXiv:2511.18415v1 Announce Type: cross 
Abstract: Vision-language models (VLMs) possess rich knowledge but often fail on hierarchical understanding tasks, where the goal is to predict a coarse-to-fine taxonomy path that remains consistent across all levels. We compare three inference paradigms for hierarchical VQA and find that stepwise reasoning, when conditioned on prior answers, significantly outperforms single-pass prompting. Further analysis indicates that the main limitation of current VLMs is their inability to maintain cross-level state, rather than a lack of taxonomic knowledge. Motivated by this diagnosis, we propose Self-Elicited Knowledge Distillation (SEKD), which requires no human labels or external tools: the same VLM is prompted to reason step by step and act as a teacher by exposing its hard labels, soft distributions, and decoder hidden states, while a single-pass student distills these signals. The student VLM remains efficient while approaching the accuracy of its multi-step teacher. It improves in-domain path consistency (HCA) by up to +29.50 percentage points, raises zero-shot HCA on an unseen taxonomy from 4.15% to 42.26%, and yields gains on challenging mathematical benchmarks. Because all supervision is self-elicited, SEKD scales to new taxonomies and datasets without annotation cost, providing a practical route to imbue compact VLMs with dependency-aware multi-step reasoning.

قدمت دراسة حديثة تقنية تدعى التقطير الذاتي للمعرفة (SEKD) كوسيلة لتحسين أداء نماذج اللغة والرؤية (VLMs) في مهام الفهم الهرمي. تتيح هذه الطريقة لنماذج VLMs التفكير خطوة بخطوة، مما يحسن قدرتها على الحفاظ على حالة عبر المستويات وتحقيق التناسق الهرمي دون الحاجة إلى تسميات بشرية أو أدوات خارجية.

Un estudio reciente presentó la Distilación de Conocimiento Auto-Elicitada (SEKD) como un método para mejorar el rendimiento de los Modelos de Lenguaje y Visión (VLMs) en tareas de comprensión jerárquica. Este enfoque permite a los VLMs razonar paso a paso, mejorando su capacidad para mantener un estado a través de niveles y lograr consistencia jerárquica sin necesidad de etiquetas humanas o herramientas externas.

Une étude récente a introduit la Distillation de Connaissances Auto-Élicitée (SEKD) comme méthode pour améliorer les performances des Modèles Vision-Langage (VLMs) dans les tâches de compréhension hiérarchique. Cette approche permet aux VLMs de raisonner étape par étape, améliorant leur capacité à maintenir un état transversal et à atteindre une cohérence hiérarchique sans avoir besoin d'étiquettes humaines ou d'outils externes.

A recent study introduced Self-Elicited Knowledge Distillation (SEKD) as a method to enhance the performance of Vision-Language Models (VLMs) in hierarchical understanding tasks. This approach allows VLMs to reason step by step, improving their ability to maintain cross-level state and achieve hierarchical consistency without the need for human labels or external tools.

The Potential and Limitations of Vision-Language Models for Human Motion Understanding: A Case Study in Data-Driven Stroke Rehabilitation

Was this article worth reading? Share it

Interactive Avatar

Tattoo Visualizer

All Voice Lab