World PulseNowPowered by AI

Trending:

Text-Only Training for Image Captioning with Retrieval Augmentation and Modality Gap Correction

arXiv — cs.CV•Friday, December 5, 2025 at 5:00:00 AM

PositiveArtificial Intelligence

A new method for image captioning, named TOMCap, has been introduced, which allows for text-only training without the need for aligned image-caption pairs. This approach utilizes a pre-trained language model decoder and incorporates information from CLIP representations to enhance the caption generation process while addressing the modality gap.
The development of TOMCap is significant as it reduces reliance on curated datasets, potentially democratizing access to effective image captioning techniques and enabling broader applications in various fields, including accessibility and content creation.
This advancement reflects a growing trend in artificial intelligence towards leveraging pre-trained models and innovative training methods, as seen in related works that explore semantic segmentation, continual learning, and robustness against adversarial challenges, highlighting the ongoing evolution in the intersection of vision and language technologies.

— via World Pulse Now AI Editorial System

Was this article worth reading? Share it

Recommended apps based on your readingExplore all apps

One More Thing in AI

Master AI with curated tools and tutorials for practical, real-world applications.

AI & DataVisit website

Humanize AI

Transform AI-generated text into undetectable, human-like content effortlessly.

Business & ProductivityView app details

Imagerr AI

Generate accurate alt text for images instantly using advanced AI technology.

Business & ProductivityView app details

Capte

AI-powered video editing that simplifies and enhances your creative workflow.

AI & DataView app details

AI Humanizer

Transform AI text into human-like content that bypasses detection tools.

Business & ProductivityView app details

OpenL Translator

Instantly translate text from images of signs and menus with accuracy.

AI & DataView app details

Continue Readings

CAPE: A CLIP-Aware Pointing Ensemble of Complementary Heatmap Cues for Embodied Reference Understanding

arXiv — cs.CV3 days ago

CAPE: A CLIP-Aware Pointing Ensemble of Complementary Heatmap Cues for Embodied Reference Understanding

PositiveArtificial Intelligence

A new framework called CAPE has been introduced to enhance Embodied Reference Understanding, which involves predicting the object a person refers to through pointing gestures and language. This approach utilizes a dual-model framework that learns from both head-to-fingertip and wrist-to-fingertip directions, employing a Gaussian ray heatmap representation to improve the model's attention to pointing cues.

Read full article

via arXiv — cs.CV

VL-JEPA: Joint Embedding Predictive Architecture for Vision-language

arXiv — cs.CV3 days ago

VL-JEPA: Joint Embedding Predictive Architecture for Vision-language

PositiveArtificial Intelligence

VL-JEPA has been introduced as a vision-language model utilizing a Joint Embedding Predictive Architecture (JEPA), which predicts continuous embeddings of target texts rather than generating tokens autoregressively. This model demonstrates improved performance with 50% fewer trainable parameters compared to traditional token-space models, highlighting its efficiency in processing vision-language tasks.

Read full article

via arXiv — cs.CV

Zero-shot Adaptation of Stable Diffusion via Plug-in Hierarchical Degradation Representation for Real-World Super-Resolution

arXiv — cs.CV3 days ago

Zero-shot Adaptation of Stable Diffusion via Plug-in Hierarchical Degradation Representation for Real-World Super-Resolution

PositiveArtificial Intelligence

A new approach called HD-CLIP has been proposed for Real-World Image Super-Resolution (Real-ISR), which aims to enhance the recovery of high-quality images from low-quality inputs affected by complex real-world degradations. This method decomposes low-quality images into semantic and ordinal degradation embeddings, allowing for better guidance in diffusion models.

Read full article

via arXiv — cs.CV

Unleashing Degradation-Carrying Features in Symmetric U-Net: Simpler and Stronger Baselines for All-in-One Image Restoration

arXiv — cs.CV3 days ago

Unleashing Degradation-Carrying Features in Symmetric U-Net: Simpler and Stronger Baselines for All-in-One Image Restoration

PositiveArtificial Intelligence

A new study has introduced SymUNet, a symmetric U-Net architecture designed for all-in-one image restoration, effectively handling various degradations such as noise and blur. This approach simplifies the architecture while achieving state-of-the-art performance across benchmark datasets by utilizing well-crafted feature extraction and streamlined cross-scale propagation.

Read full article

via arXiv — cs.CV

Beyond Pixels: A Training-Free, Text-to-Text Framework for Remote Sensing Image Retrieval

arXiv — cs.CV3 days ago

Beyond Pixels: A Training-Free, Text-to-Text Framework for Remote Sensing Image Retrieval

PositiveArtificial Intelligence

A new framework for remote sensing image retrieval, named TRSLLaVA, has been introduced, which operates without the need for training. This framework utilizes the Remote Sensing Rich Text (RSRT) dataset, providing multiple structured captions per image to enhance semantic retrieval capabilities.

Read full article

via arXiv — cs.CV

Lang2Motion: Bridging Language and Motion through Joint Embedding Spaces

arXiv — cs.CV3 days ago

Lang2Motion: Bridging Language and Motion through Joint Embedding Spaces

PositiveArtificial Intelligence

Lang2Motion has been introduced as a framework that generates language-guided point trajectories by aligning motion manifolds with joint embedding spaces, achieving significant improvements in text-to-trajectory retrieval and motion accuracy compared to existing video-based methods.

Read full article

via arXiv — cs.CV

Panoramic Out-of-Distribution Segmentation

arXiv — cs.CV3 days ago

Panoramic Out-of-Distribution Segmentation

PositiveArtificial Intelligence

A new task called Panoramic Out-of-Distribution Segmentation (PanOoS) has been introduced to enhance the understanding of panoramic images, which are crucial for applications like autonomous driving and augmented reality. The proposed solution, named POS, utilizes text-guided prompt distribution learning to address challenges such as pixel distortions and background clutter that hinder current segmentation methods.

Read full article

via arXiv — cs.CV

Ready to build your own newsroom?

Subscribe to unlock a personalised feed, podcasts, newsletters, and notifications tailored to the topics you actually care about