VLIC: Vision-Language Models As Perceptual Judges for Human-Aligned Image Compression

arXiv — cs.CV•Thursday, December 18, 2025 at 5:00:00 AM

PositiveArtificial Intelligence

A new study introduces Vision-Language Models for Image Compression (VLIC), which utilizes state-of-the-art vision-language models to evaluate image compression performance based on human preferences. The research highlights that traditional distortion functions like MSE do not align well with human perception, prompting the need for innovative approaches in image compression.
The development of VLIC is significant as it aims to enhance image compression systems by integrating human-aligned judgments, potentially improving the quality of compressed images and making them more suitable for human users. This could lead to advancements in various applications, including digital media and visual content delivery.
This advancement reflects a broader trend in artificial intelligence where models are increasingly being designed to understand and replicate human perceptual judgments. The integration of human preferences into machine learning models is becoming essential, especially in fields like image processing and accessibility for individuals with visual impairments, as seen in recent studies evaluating image quality for blind and low-vision users.

— via World Pulse Now AI Editorial System

Read Original

Was this article worth reading? Share it

LucidQuery AI

Combines diffusion reasoning with autoregressive LLM for advanced AI analysis.

AI & DataView app details

WasItAI

Verify if your images are AI-generated with this simple detection tool.

Business & ProductivityView app details

The Visualizer

Transform complex topics into clear, visual explanations for effortless learning.

AI & DataView app details

Blunge

Train your own private AI image models to protect and personalize your unique artistic style.

Creative & DesignView app details

Videolulu

Generate faceless videos automatically for your content needs.

AI & DataView app details

AIPortalX

Browse, compare, and use over 100 verified AI models with detailed insights and filtering.

Creative & DesignView app details

Continue Readings

arXiv — cs.CV2 days ago

If you can describe it, they can see it: Cross-Modal Learning of Visual Concepts from Textual Descriptions

PositiveArtificial Intelligence

A novel approach called Knowledge Transfer (KT) has been introduced to enhance Vision-Language Models (VLMs) by enabling them to learn new visual concepts solely from textual descriptions. This method aligns visual features with text representations, allowing VLMs to visualize previously unknown concepts without relying on visual examples or external generative models.

Read full article

via arXiv — cs.CV

arXiv — cs.CV2 days ago

Abstract 3D Perception for Spatial Intelligence in Vision-Language Models

PositiveArtificial Intelligence

Vision-language models (VLMs) have been found to struggle with 3D-related tasks, which are essential for applications in robotics and embodied agents. To address this issue, a new framework called SandboxVLM has been introduced, which utilizes abstract bounding boxes to enhance the encoding of geometric structures and physical kinematics, thereby improving spatial intelligence in VLMs.

Read full article

via arXiv — cs.CV

arXiv — cs.CV2 days ago

PerTouch: VLM-Driven Agent for Personalized and Semantic Image Retouching

PositiveArtificial Intelligence

PerTouch has introduced a unified diffusion-based framework for image retouching that enhances visual quality while aligning with user preferences. This method employs parameter maps to facilitate fine-grained adjustments and incorporates mechanisms for improved semantic boundary perception, allowing for a more personalized retouching experience.

Read full article

via arXiv — cs.CV

arXiv — cs.CV2 days ago

From Segments to Scenes: Temporal Understanding in Autonomous Driving via Vision-Language Model

PositiveArtificial Intelligence

The introduction of the Temporal Understanding in Autonomous Driving (TAD) benchmark addresses the significant challenge of temporal reasoning in autonomous driving, specifically focusing on ego-centric footage. This benchmark evaluates Vision-Language Models (VLMs) through nearly 6,000 question-answer pairs across seven tasks, highlighting the limitations of current state-of-the-art models in accurately capturing dynamic relationships in driving scenarios.

Read full article

via arXiv — cs.CV

Ready to build your own newsroom?

Subscribe to unlock a personalised feed, podcasts, newsletters, and notifications tailored to the topics you actually care about