MambaEye: A Size-Agnostic Visual Encoder with Causal Sequential Processing

arXiv — cs.CV•Wednesday, November 26, 2025 at 5:00:00 AM

PositiveArtificial Intelligence

MambaEye has been introduced as a novel visual encoder that operates in a size-agnostic manner, utilizing a causal sequential processing approach. This model leverages the Mamba2 backbone and introduces relative move embedding to enhance adaptability to various image resolutions and scanning patterns, addressing a long-standing challenge in visual encoding.
The development of MambaEye is significant as it represents a step forward in creating a visual encoder that aligns more closely with human vision capabilities, potentially improving applications in computer vision and artificial intelligence.
This advancement reflects ongoing efforts in the AI community to enhance model interpretability and efficiency, particularly in the context of State Space Models and Vision Transformers. The introduction of frameworks for explainability and innovations in model architecture highlight a trend towards more adaptable and efficient AI systems.

— via World Pulse Now AI Editorial System

Read Original

Was this article worth reading? Share it

Continue Readings

arXiv — cs.CV16 hours ago

Decorrelation Speeds Up Vision Transformers

PositiveArtificial Intelligence

Recent advancements in the optimization of Vision Transformers (ViTs) have been achieved through the integration of Decorrelated Backpropagation (DBP) into Masked Autoencoder (MAE) pre-training, resulting in a 21.1% reduction in wall-clock time and a 21.4% decrease in carbon emissions during training on datasets like ImageNet-1K and ADE20K.

Read full article

via arXiv — cs.CV

arXiv — cs.CV2 days ago

Rethinking Vision Transformer Depth via Structural Reparameterization

PositiveArtificial Intelligence

A new study proposes a branch-based structural reparameterization technique for Vision Transformers, aiming to reduce the number of stacked transformer layers while maintaining their representational capacity. This method operates during the training phase, allowing for the consolidation of parallel branches into streamlined models for efficient inference deployment.

Read full article

via arXiv — cs.CV

arXiv — cs.LG2 days ago

Latent Diffusion Inversion Requires Understanding the Latent Space

NeutralArtificial Intelligence

Recent research highlights the need for a deeper understanding of latent space in Latent Diffusion Models (LDMs), revealing that these models exhibit uneven memorization across latent codes and that different dimensions within a single latent code contribute variably to memorization. This study introduces a method to rank these dimensions based on their impact on the decoder pullback metric.

Read full article

via arXiv — cs.LG

arXiv — cs.CV3 days ago

TSRE: Channel-Aware Typical Set Refinement for Out-of-Distribution Detection

PositiveArtificial Intelligence

A new method called Channel-Aware Typical Set Refinement (TSRE) has been proposed for Out-of-Distribution (OOD) detection, addressing the limitations of existing activation-based methods that often neglect channel characteristics, leading to inaccurate typical set estimations. This method enhances the separation between in-distribution and OOD data, improving model reliability in open-world environments.

Read full article

via arXiv — cs.CV

arXiv — cs.CV3 days ago

VeCoR - Velocity Contrastive Regularization for Flow Matching

PositiveArtificial Intelligence

The introduction of Velocity Contrastive Regularization (VeCoR) enhances Flow Matching (FM) by implementing a balanced attract-repel scheme, which guides the learned velocity field towards stable directions while avoiding off-manifold errors. This development aims to improve stability and generalization in generative modeling, particularly in lightweight configurations.

Read full article

via arXiv — cs.CV

arXiv — cs.CV3 days ago

CascadedViT: Cascaded Chunk-FeedForward and Cascaded Group Attention Vision Transformer

PositiveArtificial Intelligence

The CascadedViT (CViT) architecture introduces a lightweight and compute-efficient Vision Transformer, featuring the innovative Cascaded-Chunk Feed Forward Network (CCFFN), which enhances parameter and FLOP efficiency while maintaining accuracy. Experiments on ImageNet-1K indicate that the CViT-XL model achieves 75.5% Top-1 accuracy, reducing FLOPs by 15% and energy consumption by 3.3% compared to EfficientViT-M5.

Read full article

via arXiv — cs.CV