Viper-F1: Fast and Fine-Grained Multimodal Understanding with Cross-Modal State-Space Modulation

arXiv — cs.CV•Wednesday, November 19, 2025 at 5:00:00 AM

PositiveArtificial Intelligence

Viper
The development of Viper
The introduction of Viper

— via World Pulse Now AI Editorial System

Read Original

Was this article worth reading? Share it

Recommended Readings

arXiv — cs.LG16 hours ago

Algebraformer: A Neural Approach to Linear Systems

PositiveArtificial Intelligence

The recent development of Algebraformer, a Transformer-based architecture, aims to address the challenges of solving ill-conditioned linear systems. Traditional numerical methods often require extensive parameter tuning and domain expertise to ensure accuracy. Algebraformer proposes an end-to-end learned model that efficiently represents matrix and vector inputs, achieving scalable inference with a memory complexity of O(n^2). This innovation could significantly enhance the reliability and stability of solutions in various application-driven linear problems.

Read full article

via arXiv — cs.LG

arXiv — cs.LG16 hours ago

Blurred Encoding for Trajectory Representation Learning

PositiveArtificial Intelligence

The article presents a novel approach to trajectory representation learning (TRL) through a method called BLUrred Encoding (BLUE). This technique addresses the limitations of existing TRL methods that often lose fine-grained spatial-temporal details by grouping GPS points into larger segments. BLUE creates hierarchical patches of varying sizes, allowing for the preservation of detailed travel semantics while capturing overall travel patterns. The model employs an encoder-decoder structure with a pyramid design to enhance the representation of trajectories.

Read full article

via arXiv — cs.LG

arXiv — cs.LG16 hours ago

MoM: Linear Sequence Modeling with Mixture-of-Memories

PositiveArtificial Intelligence

The paper titled 'MoM: Linear Sequence Modeling with Mixture-of-Memories' introduces a new architecture designed to enhance linear sequence modeling methods. Traditional approaches often compress input sequences into a single fixed-size memory state, which can hinder performance in recall-intensive tasks. The Mixture-of-Memories (MoM) architecture addresses this by utilizing multiple independent memory states, improving memory capacity and reducing interference. This framework can be integrated with various memory update mechanisms, leading to superior performance in recall tasks.

Read full article

via arXiv — cs.LG

arXiv — cs.LG16 hours ago

MAT-MPNN: A Mobility-Aware Transformer-MPNN Model for Dynamic Spatiotemporal Prediction of HIV Diagnoses in California, Florida, and New England

PositiveArtificial Intelligence

The study introduces the Mobility-Aware Transformer-Message Passing Neural Network (MAT-MPNN) model, designed to enhance the prediction of HIV diagnosis rates across California, Florida, and New England. This model addresses the limitations of traditional Message Passing Neural Networks, which rely on fixed binary adjacency matrices that fail to capture interactions between non-contiguous regions. By integrating a Transformer encoder for temporal features and a Mobility Graph Generator for spatial relationships, MAT-MPNN aims to improve forecasting accuracy in HIV diagnoses.

Read full article

via arXiv — cs.LG

arXiv — cs.CV16 hours ago

Unlocking the Forgery Detection Potential of Vanilla MLLMs: A Novel Training-Free Pipeline

PositiveArtificial Intelligence

The article discusses a novel training-free pipeline called Foresee, designed for image forgery detection using vanilla multimodal large language models (MLLMs). As artificial intelligence-generated content technologies advance, traditional image forgery detection methods struggle with generalization and interpretability. Foresee aims to address these challenges by enabling lightweight inference without additional training, showcasing the inherent potential of MLLMs in image forgery analysis.

Read full article

via arXiv — cs.CV

arXiv — cs.CV16 hours ago

AdaTok: Adaptive Token Compression with Object-Aware Representations for Efficient Multimodal LLMs

PositiveArtificial Intelligence

AdaTok introduces an innovative object-level token merging strategy for Adaptive Token compression, aimed at enhancing the efficiency of Multimodal Large Language Models (MLLMs). Traditional patch-level tokenization has resulted in excessive computational and memory demands, leading to misalignments with human cognitive processes. The proposed method significantly reduces token usage to 10% while maintaining nearly 96% of the original model's performance, addressing critical challenges in multimodal understanding and reasoning.

Read full article

via arXiv — cs.CV

arXiv — cs.CV16 hours ago

Region-Wise Correspondence Prediction between Manga Line Art Images

PositiveArtificial Intelligence

Understanding region-wise correspondences between manga line art images is essential for advanced manga processing, aiding tasks like line art colorization and in-between frame generation. This study introduces a novel task of predicting these correspondences without annotations. A Transformer-based framework is proposed, trained on large-scale, automatically generated region correspondences, which enhances feature alignment across images by suppressing noise and reinforcing structural relationships.

Read full article

via arXiv — cs.CV

arXiv — cs.CV16 hours ago

FreeSwim: Revisiting Sliding-Window Attention Mechanisms for Training-Free Ultra-High-Resolution Video Generation

PositiveArtificial Intelligence

The paper titled 'FreeSwim: Revisiting Sliding-Window Attention Mechanisms for Training-Free Ultra-High-Resolution Video Generation' addresses the challenges posed by the quadratic time and memory complexity of attention mechanisms in Transformer-based video generators. This complexity makes end-to-end training for ultra-high-resolution videos costly. The authors propose a training-free method that utilizes video Diffusion Transformers pretrained at their native scale to generate higher resolution videos without additional training. Central to this approach is an inward sliding window attentio…

Read full article

via arXiv — cs.CV