Self Pre-training with Topology- and Spatiality-aware Masked Autoencoders for 3D Medical Image Segmentation

arXiv — cs.CV•Thursday, November 20, 2025 at 5:00:00 AM

PositiveArtificial Intelligence

The study presents a new self pre
and spatiality
This development is significant as it enhances the capabilities of Vision Transformers in medical image analysis, potentially leading to improved diagnostic tools and techniques in healthcare.
The research aligns with ongoing efforts to optimize Vision Transformers, highlighting the importance of geometric and spatial awareness in machine learning models for medical applications.

— via World Pulse Now AI Editorial System

Read Original

Was this article worth reading? Share it

Recommended Readings

arXiv — cs.CV7 hours ago

Learning from the Right Patches: A Two-Stage Wavelet-Driven Masked Autoencoder for Histopathology Representation Learning

PositiveArtificial Intelligence

The paper presents a two-stage wavelet-driven masked autoencoder (WISE-MAE) framework designed for histopathology representation learning. It addresses the challenges of self-supervised learning in digital pathology by improving patch selection through a wavelet-informed strategy. This method enhances the model's ability to capture relevant tissue patterns, thereby aligning more closely with the diagnostic processes of pathologists.

Read full article

via arXiv — cs.CV

arXiv — cs.CV7 hours ago

One Latent Space to Rule All Degradations: Unifying Restoration Knowledge for Image Fusion

PositiveArtificial Intelligence

The article discusses the introduction of LURE, a Learning-driven Unified REpresentation model designed for infrared and visible image fusion. This model addresses the limitations of existing All-in-One Degradation-Aware Fusion Models (ADFMs) by creating a Unified Latent Feature Space (ULFS) that enhances image quality while reducing dependency on complex datasets. LURE aims to improve the performance of multi-modal image fusion by leveraging intrinsic relationships between different modalities.

Read full article

via arXiv — cs.CV

arXiv — cs.CV7 hours ago

From Low-Rank Features to Encoding Mismatch: Rethinking Feature Distillation in Vision Transformers

PositiveArtificial Intelligence

Feature-map knowledge distillation (KD) is effective for convolutional networks but often fails for Vision Transformers (ViTs). A two-view representation analysis reveals that final-layer representations in ViTs are globally low-rank, suggesting that a compact student model should suffice for feature alignment. However, a token-level Spectral Energy Pattern analysis shows that individual tokens distribute energy across many channels, indicating a mismatch in encoding.

Read full article

via arXiv — cs.CV

arXiv — cs.CV7 hours ago

Application of Graph Based Vision Transformers Architectures for Accurate Temperature Prediction in Fiber Specklegram Sensors

PositiveArtificial Intelligence

This study explores the application of transformer-based architectures for predicting temperature variations using Fiber Specklegram Sensors (FSS). The research highlights the challenges posed by the nonlinear nature of specklegram data and demonstrates that Vision Transformers (ViTs) achieved a Mean Absolute Error (MAE) of 1.15, outperforming traditional models like CNNs. The findings underscore the potential of advanced transformer models in enhancing environmental monitoring capabilities.

Read full article

via arXiv — cs.CV

arXiv — cs.LGa day ago

EBind: a practical approach to space binding

PositiveArtificial Intelligence

EBind is a novel approach to space binding that simplifies the process by utilizing a single encoder per modality and high-quality data. This method allows for the training of state-of-the-art models on a single GPU within hours, significantly reducing the time compared to traditional methods. EBind employs a dataset comprising 6.7 million automated multimodal quintuples, 1 million semi-automated triples, and 3.4 million captioned data items, demonstrating superior performance with a 1.8 billion parameter model.

Read full article

via arXiv — cs.LG

arXiv — cs.CVa day ago

Vision Transformers with Self-Distilled Registers

PositiveArtificial Intelligence

Vision Transformers (ViTs) have become the leading architecture for visual processing tasks, showcasing remarkable scalability with larger training datasets and model sizes. However, recent findings have revealed the presence of artifact tokens in ViTs that conflict with local semantics, negatively impacting performance in tasks requiring precise localization and structural coherence. This paper introduces register tokens to mitigate this issue, proposing Post Hoc Registers (PH-Reg) as an efficient self-distillation method to integrate these tokens into existing ViTs without the need for retra…

Read full article

via arXiv — cs.CV

arXiv — cs.CVa day ago

Can You Learn to See Without Images? Procedural Warm-Up for Vision Transformers

PositiveArtificial Intelligence

This study explores a novel approach to enhance vision transformers (ViTs) by pretraining them on procedurally-generated data that lacks visual or semantic content. Utilizing simple algorithms, the research aims to instill generic biases in ViTs, allowing them to internalize abstract computational priors. The findings indicate that this warm-up phase, followed by standard image-based training, significantly boosts data efficiency, convergence speed, and overall performance, with notable improvements observed on ImageNet-1k.

Read full article

via arXiv — cs.CV

arXiv — cs.LG2 days ago

Region-Point Joint Representation for Effective Trajectory Similarity Learning

PositiveArtificial Intelligence

Recent advancements in learning-based methods have significantly reduced the computational complexity associated with traditional trajectory similarity computation. However, current state-of-the-art methods do not fully utilize the extensive range of trajectory information for effective similarity modeling. To address this issue, a novel method named RePo has been proposed. This method jointly encodes region-wise and point-wise features to effectively capture both spatial context and detailed moving patterns. The approach involves mapping GPS trajectories to grid sequences and utilizing lightw…

Read full article

via arXiv — cs.LG