TRANSPORTER: Transferring Visual Semantics from VLM Manifolds

arXiv — cs.CV•Tuesday, November 25, 2025 at 5:00:00 AM

PositiveArtificial Intelligence

The paper introduces TRANSPORTER, a model-independent approach designed to enhance video generation by transferring visual semantics from Vision Language Models (VLMs). This method addresses the challenge of understanding how VLMs derive their predictions, particularly in complex scenes with various objects and actions. TRANSPORTER generates videos that reflect changes in captions across diverse attributes and contexts.
This development is significant as it leverages the high visual fidelity of text-to-video (T2V) models, enabling the generation of videos that align closely with the semantic embeddings of VLMs. By improving the interpretability of VLM predictions, TRANSPORTER could enhance applications in video understanding and generation, which are increasingly relevant in AI-driven content creation.
The introduction of TRANSPORTER aligns with ongoing efforts to improve spatial reasoning and object-interaction capabilities in VLMs, addressing existing limitations in 3D understanding and fine-grained reasoning. As the field evolves, the integration of diverse datasets and innovative models like TRANSPORTER reflects a broader trend towards enhancing the robustness and accuracy of AI systems in interpreting complex visual information.

— via World Pulse Now AI Editorial System

Read Original

Was this article worth reading? Share it

MarsHub

Streamline localization projects for LSPs, enterprises, and linguists with our advanced cloud-based TMS.

Tech & Developer ToolsTry the app

TypeThinkAI

Compare top AI models and generate text, images, and videos in one platform.

AI & DataTry the app

Synthesia

Create realistic AI videos with custom avatars and voiceovers in minutes.

AI & DataTry the app

Continue Readings

arXiv — cs.CVa day ago

Frame-wise Conditioning Adaptation for Fine-Tuning Diffusion Models in Text-to-Video Prediction

PositiveArtificial Intelligence

A new method called Frame-wise Conditioning Adaptation (FCA) has been proposed to enhance text-to-video prediction (TVP) by improving the continuity of generated video frames based on initial frames and descriptive text. This approach addresses limitations in existing models that often rely on text-to-image pre-training, which can lead to disjointed video outputs.

Read full article

via arXiv — cs.CV

arXiv — cs.CVa day ago

Systematic Reward Gap Optimization for Mitigating VLM Hallucinations

PositiveArtificial Intelligence

A novel framework called Topic-level Preference Rewriting (TPR) has been introduced to systematically optimize reward gaps in Vision Language Models (VLMs), addressing the challenges of hallucinations during data curation. This method focuses on selectively replacing semantic topics within VLM responses to enhance the accuracy of generated outputs.

Read full article

via arXiv — cs.CV

arXiv — cs.CVa day ago

IAG: Input-aware Backdoor Attack on VLM-based Visual Grounding

NeutralArtificial Intelligence

A novel vulnerability in vision-language models (VLMs) has been identified through the introduction of IAG, a method that enables multi-target backdoor attacks on VLM-based visual grounding systems. This technique utilizes dynamically generated, input-aware triggers that are text-guided, allowing for imperceptible manipulation of visual inputs while maintaining normal performance on benign samples.

Read full article

via arXiv — cs.CV

arXiv — cs.CVa day ago

Now You See It, Now You Don't - Instant Concept Erasure for Safe Text-to-Image and Video Generation

PositiveArtificial Intelligence

Researchers have introduced Instant Concept Erasure (ICE), a novel approach for robust concept removal in text-to-image (T2I) and text-to-video (T2V) models. This method eliminates the need for costly retraining and minimizes inference overhead while addressing vulnerabilities to adversarial attacks. ICE employs a training-free, one-shot weight modification technique that ensures precise and persistent unlearning without collateral damage to surrounding content.

Read full article

via arXiv — cs.CV

arXiv — cs.CVa day ago

Can Vision-Language Models Count? A Synthetic Benchmark and Analysis of Attention-Based Interventions

NeutralArtificial Intelligence

Recent research indicates that Vision Language Models (VLMs) often exhibit biases learned during training, particularly when tasked with specific queries about visual properties, such as counting objects in images. A new synthetic benchmark dataset and evaluation framework have been developed to assess how counting performance varies with different image and prompt characteristics.

Read full article

via arXiv — cs.CV

arXiv — cs.CVa day ago

Spotlight: Identifying and Localizing Video Generation Errors Using VLMs

PositiveArtificial Intelligence

A new task named Spotlight has been introduced to identify and localize video generation errors in text-to-video models (T2V), which can produce high-quality videos but still exhibit nuanced errors. The research generated 600 videos using diverse prompts and three advanced video generators, annotating over 1600 specific errors across various categories such as motion and physics.

Read full article

via arXiv — cs.CV

arXiv — cs.CVa day ago

Zero-Reference Joint Low-Light Enhancement and Deblurring via Visual Autoregressive Modeling with VLM-Derived Modulation

PositiveArtificial Intelligence

A new generative framework has been proposed for enhancing low-light images and reducing blur, utilizing visual autoregressive modeling guided by perceptual priors from vision-language models. This approach addresses significant challenges in restoring dark images, which often suffer from low visibility, contrast, noise, and blur.

Read full article

via arXiv — cs.CV

arXiv — cs.CVa day ago

DiffSeg30k: A Multi-Turn Diffusion Editing Benchmark for Localized AIGC Detection

PositiveArtificial Intelligence

The introduction of DiffSeg30k marks a significant advancement in the detection of AI-generated content (AIGC) by providing a dataset of 30,000 diffusion-edited images with pixel-level annotations. This dataset allows for fine-grained detection of localized edits, addressing a gap in existing benchmarks that typically assess entire images without considering localized modifications.

Read full article

via arXiv — cs.CV