World PulseNowPowered by AI

Trending:

Can You Learn to See Without Images? Procedural Warm-Up for Vision Transformers

arXiv — cs.CV•Wednesday, November 19, 2025 at 5:00:00 AM

PositiveArtificial Intelligence

The research introduces a method for pretraining vision transformers (ViTs) using procedurally
This development is significant as it improves the efficiency and effectiveness of ViTs, which are increasingly utilized in various AI applications. Enhanced performance on datasets like ImageNet
The study reflects ongoing efforts to optimize transformer architectures, highlighting the importance of data efficiency in AI training. As the field evolves, understanding the balance between abstract training and conventional methods remains crucial for future innovations in AI and machine learning.

— via World Pulse Now AI Editorial System

Was this article worth reading? Share it

Recommended Readings

Optimality and NP-Hardness of Transformers in Learning Markovian Dynamical Functions

arXiv — cs.LG19 hours ago

Optimality and NP-Hardness of Transformers in Learning Markovian Dynamical Functions

NeutralArtificial Intelligence

The study investigates the capabilities of transformer architectures in learning Markovian dynamical functions through in-context learning (ICL). It reveals that while transformers can solve unseen tasks based on input-output pairs, the optimization of parameters for a single-layer linear self-attention model is NP-hard. This indicates a significant limitation in representing structured dynamical functions, providing insights into the loss landscape and optimization behaviors of transformers.

Read full article

via arXiv — cs.LG

Near-Lossless Model Compression Enables Longer Context Inference in DNA Large Language Models

arXiv — cs.LG19 hours ago

Near-Lossless Model Compression Enables Longer Context Inference in DNA Large Language Models

PositiveArtificial Intelligence

Recent advancements in DNA large language models (LLMs) have led to the introduction of FOCUS, a near-lossless model compression technique. This innovation addresses the challenges of high computational costs and memory requirements during autoregressive decoding, which have previously limited the effectiveness of LLMs in processing ultra-long genomic sequences. By integrating a progressive context-compression module, FOCUS enhances the ability of these models to retain distant information, thereby improving their performance in DNA sequence modeling.

Read full article

via arXiv — cs.LG

Attention via Synaptic Plasticity is All You Need: A Biologically Inspired Spiking Neuromorphic Transformer

arXiv — stat.ML19 hours ago

Attention via Synaptic Plasticity is All You Need: A Biologically Inspired Spiking Neuromorphic Transformer

PositiveArtificial Intelligence

The article discusses a new approach to attention mechanisms in artificial intelligence, inspired by biological synaptic plasticity. This method aims to improve energy efficiency in spiking neural networks (SNNs) compared to traditional Transformers, which rely on dot-product similarity. The research highlights the limitations of current spiking attention models and proposes a biologically inspired spiking neuromorphic transformer that could reduce the carbon footprint associated with large language models (LLMs) like GPT.

Read full article

via arXiv — stat.ML

GloTok: Global Perspective Tokenizer for Image Reconstruction and Generation

arXiv — cs.CV19 hours ago

GloTok: Global Perspective Tokenizer for Image Reconstruction and Generation

PositiveArtificial Intelligence

The Global Perspective Tokenizer (GloTok) introduces a novel approach to image reconstruction and generation by utilizing global relational information to create a more uniform semantic distribution of tokenized features. This method addresses limitations in existing image tokenization techniques that rely on locally supervised semantic supervision. By implementing a codebook-wise histogram relation learning method, GloTok enhances the quality of image outputs, demonstrating improved generation performance compared to traditional methods.

Read full article

via arXiv — cs.CV

Vision Transformers with Self-Distilled Registers

arXiv — cs.CV19 hours ago

Vision Transformers with Self-Distilled Registers

PositiveArtificial Intelligence

Vision Transformers (ViTs) have become the leading architecture for visual processing tasks, showcasing remarkable scalability with larger training datasets and model sizes. However, recent findings have revealed the presence of artifact tokens in ViTs that conflict with local semantics, negatively impacting performance in tasks requiring precise localization and structural coherence. This paper introduces register tokens to mitigate this issue, proposing Post Hoc Registers (PH-Reg) as an efficient self-distillation method to integrate these tokens into existing ViTs without the need for retra…

Read full article

via arXiv — cs.CV

Benchmark on Drug Target Interaction Modeling from a Drug Structure Perspective

arXiv — cs.LG19 hours ago

Benchmark on Drug Target Interaction Modeling from a Drug Structure Perspective

PositiveArtificial Intelligence

The article discusses advancements in predicting drug-target interactions, a critical aspect of drug discovery and design. Recent methods utilizing deep learning technologies, particularly graph neural networks (GNNs) and Transformers, have shown remarkable performance by effectively extracting structural information. However, the benchmarking of these methods varies significantly, affecting algorithmic progress. The authors conducted a comprehensive survey and benchmark to integrate various structure learning algorithms for improved modeling.

Read full article

via arXiv — cs.LG

Attention Via Convolutional Nearest Neighbors

arXiv — cs.CV19 hours ago

Attention Via Convolutional Nearest Neighbors

PositiveArtificial Intelligence

The article introduces Convolutional Nearest Neighbors (ConvNN), a framework that unifies Convolutional Neural Networks (CNNs) and Transformers by viewing convolution and self-attention as neighbor selection and aggregation methods. ConvNN allows for a systematic exploration of the spectrum between these two architectures, serving as a drop-in replacement for convolutional and attention layers. The framework's effectiveness is validated through classification tasks on CIFAR-10 and CIFAR-100 datasets.

Read full article

via arXiv — cs.CV

Explaining Similarity in Vision-Language Encoders with Weighted Banzhaf Interactions

arXiv — cs.LG19 hours ago

Explaining Similarity in Vision-Language Encoders with Weighted Banzhaf Interactions

PositiveArtificial Intelligence

The paper titled 'Explaining Similarity in Vision-Language Encoders with Weighted Banzhaf Interactions' introduces a new method called FIxLIP for explaining the similarity outputs of vision-language models. This approach addresses limitations of existing saliency maps by utilizing the weighted Banzhaf interaction index from game theory, which enhances computational efficiency and flexibility. The study emphasizes the importance of understanding complex cross-modal interactions in language-image pre-training (LIP) models.

Read full article

via arXiv — cs.LG