CascadedViT: Cascaded Chunk-FeedForward and Cascaded Group Attention Vision Transformer

arXiv — cs.CVTuesday, November 25, 2025 at 5:00:00 AM
  • The CascadedViT (CViT) architecture introduces a lightweight and compute-efficient Vision Transformer, featuring the innovative Cascaded-Chunk Feed Forward Network (CCFFN), which enhances parameter and FLOP efficiency while maintaining accuracy. Experiments on ImageNet-1K indicate that the CViT-XL model achieves 75.5% Top-1 accuracy, reducing FLOPs by 15% and energy consumption by 3.3% compared to EfficientViT-M5.
  • This development is significant as it addresses the high computational and energy demands of Vision Transformers, making them more viable for deployment on resource-constrained devices such as mobile phones and drones, thereby expanding their usability in real-world applications.
  • The advancements in CViT reflect a broader trend in AI towards optimizing model efficiency without compromising performance. As the demand for lightweight models grows, particularly for mobile and edge computing, innovations like feature-map knowledge distillation and data-free quantization are increasingly relevant, highlighting ongoing efforts to enhance the practicality of Vision Transformers in diverse applications.
— via World Pulse Now AI Editorial System

Was this article worth reading? Share it

Recommended apps based on your readingExplore all apps
Continue Readings
Cross-modal Proxy Evolving for OOD Detection with Vision-Language Models
PositiveArtificial Intelligence
A new framework named CoEvo has been proposed for zero-shot out-of-distribution (OOD) detection in vision-language models, addressing the challenges posed by the absence of labeled negatives. CoEvo employs a bidirectional adaptation mechanism for both textual and visual proxies, dynamically refining them based on contextual information from test images. This innovation aims to enhance the reliability of OOD detection in open-world applications.
DGAE: Diffusion-Guided Autoencoder for Efficient Latent Representation Learning
PositiveArtificial Intelligence
The introduction of the Diffusion-Guided Autoencoder (DGAE) marks a significant advancement in latent representation learning, enhancing the decoder's expressiveness and effectively addressing training instability associated with GANs. This model achieves state-of-the-art performance while utilizing a latent space that is twice as compact, thus improving efficiency in image and video generative tasks.

Ready to build your own newsroom?

Subscribe to unlock a personalised feed, podcasts, newsletters, and notifications tailored to the topics you actually care about