Text-Only Training for Image Captioning with Retrieval Augmentation and Modality Gap Correction

arXiv — cs.CVFriday, December 5, 2025 at 5:00:00 AM
  • A new method for image captioning, named TOMCap, has been introduced, which allows for text-only training without the need for aligned image-caption pairs. This approach utilizes a pre-trained language model decoder and incorporates information from CLIP representations to enhance the caption generation process while addressing the modality gap.
  • The development of TOMCap is significant as it reduces reliance on curated datasets, potentially democratizing access to effective image captioning techniques and enabling broader applications in various fields, including accessibility and content creation.
  • This advancement reflects a growing trend in artificial intelligence towards leveraging pre-trained models and innovative training methods, as seen in related works that explore semantic segmentation, continual learning, and robustness against adversarial challenges, highlighting the ongoing evolution in the intersection of vision and language technologies.
— via World Pulse Now AI Editorial System

Was this article worth reading? Share it

Recommended apps based on your readingExplore all apps
Continue Readings
CAPE: A CLIP-Aware Pointing Ensemble of Complementary Heatmap Cues for Embodied Reference Understanding
PositiveArtificial Intelligence
A new framework called CAPE has been introduced to enhance Embodied Reference Understanding, which involves predicting the object a person refers to through pointing gestures and language. This approach utilizes a dual-model framework that learns from both head-to-fingertip and wrist-to-fingertip directions, employing a Gaussian ray heatmap representation to improve the model's attention to pointing cues.
VL-JEPA: Joint Embedding Predictive Architecture for Vision-language
PositiveArtificial Intelligence
VL-JEPA has been introduced as a vision-language model utilizing a Joint Embedding Predictive Architecture (JEPA), which predicts continuous embeddings of target texts rather than generating tokens autoregressively. This model demonstrates improved performance with 50% fewer trainable parameters compared to traditional token-space models, highlighting its efficiency in processing vision-language tasks.
Zero-shot Adaptation of Stable Diffusion via Plug-in Hierarchical Degradation Representation for Real-World Super-Resolution
PositiveArtificial Intelligence
A new approach called HD-CLIP has been proposed for Real-World Image Super-Resolution (Real-ISR), which aims to enhance the recovery of high-quality images from low-quality inputs affected by complex real-world degradations. This method decomposes low-quality images into semantic and ordinal degradation embeddings, allowing for better guidance in diffusion models.
Unleashing Degradation-Carrying Features in Symmetric U-Net: Simpler and Stronger Baselines for All-in-One Image Restoration
PositiveArtificial Intelligence
A new study has introduced SymUNet, a symmetric U-Net architecture designed for all-in-one image restoration, effectively handling various degradations such as noise and blur. This approach simplifies the architecture while achieving state-of-the-art performance across benchmark datasets by utilizing well-crafted feature extraction and streamlined cross-scale propagation.
Beyond Pixels: A Training-Free, Text-to-Text Framework for Remote Sensing Image Retrieval
PositiveArtificial Intelligence
A new framework for remote sensing image retrieval, named TRSLLaVA, has been introduced, which operates without the need for training. This framework utilizes the Remote Sensing Rich Text (RSRT) dataset, providing multiple structured captions per image to enhance semantic retrieval capabilities.
Lang2Motion: Bridging Language and Motion through Joint Embedding Spaces
PositiveArtificial Intelligence
Lang2Motion has been introduced as a framework that generates language-guided point trajectories by aligning motion manifolds with joint embedding spaces, achieving significant improvements in text-to-trajectory retrieval and motion accuracy compared to existing video-based methods.
Panoramic Out-of-Distribution Segmentation
PositiveArtificial Intelligence
A new task called Panoramic Out-of-Distribution Segmentation (PanOoS) has been introduced to enhance the understanding of panoramic images, which are crucial for applications like autonomous driving and augmented reality. The proposed solution, named POS, utilizes text-guided prompt distribution learning to address challenges such as pixel distortions and background clutter that hinder current segmentation methods.

Ready to build your own newsroom?

Subscribe to unlock a personalised feed, podcasts, newsletters, and notifications tailored to the topics you actually care about