arXiv:2503.19331v3 Announce Type: replace 
Abstract: Prior work using Masked Autoencoders (MAEs) typically relies on random patch masking based on the assumption that images have significant redundancies across different channels, allowing for the reconstruction of masked content using cross-channel correlations. However, this assumption does not hold in Multi-Channel Imaging (MCI), where channels may provide complementary information with minimal feature overlap. Thus, these MAEs primarily learn local structures within individual channels from patch reconstruction, failing to fully leverage cross-channel interactions and limiting their MCI effectiveness. In this paper, we present ChA-MAEViT, an MAE-based method that enhances feature learning across MCI channels via four key strategies: (1) dynamic channel-patch masking, which compels the model to reconstruct missing channels in addition to masked patches, thereby enhancing cross-channel dependencies and improving robustness to varying channel configurations; (2) memory tokens, which serve as long-term memory aids to promote information sharing across channels, addressing the challenges of reconstructing structurally diverse channels; (3) hybrid token fusion module, which merges fine-grained patch tokens with a global class token to capture richer representations; and (4) Channel-Aware Decoder, a lightweight decoder utilizes channel tokens to effectively reconstruct image patches. Experiments on satellite and microscopy datasets, CHAMMI, JUMP-CP, and So2Sat, show that ChA-MAEViT significantly outperforms state-of-the-art MCI-ViTs by 3.0-21.5%, highlighting the importance of cross-channel interactions in MCI. Our code is publicly available at https://github.com/chaudatascience/cha_mae_vit.

تقدم ورقة بحثية حديثة ChA-MAEViT، وهو نهج جديد يجمع بين مشفرات تلقائية مقنعة حساسة للقناة ومحولات رؤية متعددة القنوات لتحسين التعلم عبر القنوات. هذا التطور مهم لأنه يعالج قيود المشفرات التلقائية المقنعة التقليدية، التي تفترض غالبًا وجود تكرار بين قنوات الصورة. من خلال الاعتراف بأن القنوات يمكن أن تقدم معلومات تكاملية، تهدف هذه الطريقة الجديدة إلى تحسين كفاءة ودقة إعادة بناء الصور في سيناريوهات التصوير متعدد القنوات.

Un artículo reciente presenta ChA-MAEViT, un enfoque novedoso que combina autoencoders enmascarados conscientes del canal con transformadores de visión multicanal para mejorar el aprendizaje entre canales. Este desarrollo es significativo ya que aborda las limitaciones de los autoencoders enmascarados tradicionales, que a menudo asumen redundancia entre los canales de imagen. Al reconocer que los canales pueden ofrecer información complementaria, este nuevo método busca mejorar la eficiencia y precisión de la reconstrucción de imágenes en escenarios de imagen multicanal.

Un article récent présente ChA-MAEViT, une nouvelle approche qui combine des autoencodeurs masqués sensibles au canal avec des transformateurs de vision multi-canaux pour améliorer l'apprentissage inter-canaux. Ce développement est important car il aborde les limites des autoencodeurs masqués traditionnels, qui supposent souvent une redondance entre les canaux d'image. En reconnaissant que les canaux peuvent offrir des informations complémentaires, cette nouvelle méthode vise à améliorer l'efficacité et la précision de la reconstruction d'images dans des scénarios d'imagerie multi-canaux.

A recent paper introduces ChA-MAEViT, a novel approach that combines Channel-Aware Masked Autoencoders with Multi-Channel Vision Transformers to enhance cross-channel learning. This development is significant as it addresses the limitations of traditional Masked Autoencoders, which often assume redundancy across image channels. By recognizing that channels can offer complementary information, this new method aims to improve the efficiency and accuracy of image reconstruction in Multi-Channel Imaging scenarios.

ChA-MAEViT: Unifying Channel-Aware Masked Autoencoders and Multi-Channel Vision Transformers for Improved Cross-Channel Learning

arXiv:2509.21628v2 Announce Type: replace 
Abstract: Large vision models differ widely in architecture and training paradigm, yet we lack principled methods to determine which aspects of their representations are shared across families and which reflect distinctive computational strategies. We leverage a suite of representational similarity metrics, each capturing a different facet-geometry, unit tuning, or linear decodability-and assess family separability using multiple complementary measures. Metrics preserving geometry or tuning (e.g., RSA, Soft Matching) yield strong family discrimination, whereas flexible mappings such as Linear Predictivity show weaker separation. These findings indicate that geometry and tuning carry family-specific signatures, while linearly decodable information is more broadly shared. To integrate these complementary facets, we adapt Similarity Network Fusion (SNF), a method inspired by multi-omics integration. SNF achieves substantially sharper family separation than any individual metric and produces robust composite signatures. Clustering of the fused similarity matrix recovers both expected and surprising patterns: supervised ResNets and ViTs form distinct clusters, yet all self-supervised models group together across architectural boundaries. Hybrid architectures (ConvNeXt, Swin) cluster with masked autoencoders, suggesting convergence between architectural modernization and reconstruction-based training. This biology-inspired framework provides a principled typology of vision models, showing that emergent computational strategies-shaped jointly by architecture and training objective-define representational structure beyond surface design categories.

تقدم دراسة حديثة تصنيفًا مدفوعًا بالبيانات لنماذج الرؤية، باستخدام مقاييس تمثيلية متكاملة لتحليل الاختلافات والتشابهات بين هياكل مختلفة مثل ResNets وViTs وConvNeXt. تستخدم الدراسة مقاييس تشابه تمثيلية لتقييم قابلية الفصل بين العائلات، مما يكشف أن الهندسة وضبط الإعدادات هي مؤشرات رئيسية للتوقيعات الخاصة بكل عائلة من النماذج.

Un estudio reciente presenta una tipología de modelos de visión basada en métricas representativas integradas, analizando las diferencias y similitudes entre diversas arquitecturas como ResNets, ViTs y ConvNeXt. La investigación utiliza métricas de similitud representativa para evaluar la separabilidad de las familias, revelando que la geometría y el ajuste son indicadores clave de las firmas específicas de cada familia de modelos.

Une étude récente présente une typologie des modèles de vision basée sur des métriques représentatives intégrées, analysant les différences et similitudes entre diverses architectures telles que ResNets, ViTs et ConvNeXt. La recherche utilise des métriques de similarité représentative pour évaluer la séparabilité des familles, révélant que la géométrie et le réglage sont des indicateurs clés des signatures spécifiques à chaque famille de modèles.

A recent study presents a data-driven typology of vision models, utilizing integrated representational metrics to analyze the differences and similarities among various architectures such as ResNets, ViTs, and ConvNeXt. The research employs representational similarity metrics to assess family separability, revealing that geometry and tuning are key indicators of family-specific signatures in these models.

A Data-driven Typology of Vision Models from Integrated Representational Metrics

arXiv:2512.06983v1 Announce Type: cross 
Abstract: World models enable agents to plan within imagined environments by predicting future states conditioned on past observations and actions. However, their ability to plan over long horizons is limited by the effective memory span of the backbone architecture. This limitation leads to perceptual drift in long rollouts, hindering the model's capacity to perform loop closures within imagined trajectories. In this work, we investigate the effective memory span of transformer-based world models through an analysis of several memory augmentation mechanisms. We introduce a taxonomy that distinguishes between memory encoding and memory injection mechanisms, motivating their roles in extending the world model's memory through the lens of residual stream dynamics. Using a state recall evaluation task, we measure the memory recall of each mechanism and analyze its respective trade-offs. Our findings show that memory mechanisms improve the effective memory span in vision transformers and provide a path to completing loop closures within a world model's imagination.

استكشفت الأبحاث الحديثة قيود آليات الذاكرة في نماذج العالم المعتمدة على المحولات، وخاصة قدرتها على التخطيط على المدى الطويل. تقدم الدراسة تصنيفًا لآليات تعزيز الذاكرة، مع التركيز على ترميز الذاكرة وحقنها، وتقيّم فعاليتها في تحسين استرجاع الذاكرة خلال مهام استرجاع الحالة.

Investigaciones recientes han explorado las limitaciones de los mecanismos de memoria en los modelos del mundo basados en transformadores, especialmente su capacidad para planificar a largo plazo. El estudio introduce una taxonomía de mecanismos de aumento de memoria, centrándose en la codificación e inyección de memoria, y evalúa su efectividad para mejorar el recuerdo de memoria durante tareas de recuerdo de estado.

Des recherches récentes ont exploré les limites des mécanismes de mémoire dans les modèles du monde basés sur des transformateurs, en particulier leur capacité à planifier sur de longues périodes. L'étude introduit une taxonomie des mécanismes d'augmentation de la mémoire, en se concentrant sur l'encodage et l'injection de mémoire, et évalue leur efficacité à améliorer le rappel de mémoire lors de tâches de rappel d'état.

Recent research has explored the limitations of memory mechanisms in transformer-based world models, particularly their ability to plan over long horizons. The study introduces a taxonomy of memory augmentation mechanisms, focusing on memory encoding and injection, and evaluates their effectiveness in improving memory recall during state recall tasks.

ChA-MAEViT: Unifying Channel-Aware Masked Autoencoders and Multi-Channel Vision Transformers for Improved Cross-Channel Learning

Was this article worth reading? Share it

LucidQuery AI

ClipCutAi

The Visualizer