World PulseNowPowered by AI

Trending:

Learning More by Seeing Less: Structure First Learning for Efficient, Transferable, and Human-Aligned Vision

arXiv — cs.CV•Thursday, November 13, 2025 at 5:00:00 AM

PositiveArtificial Intelligence

Recent advancements in computer vision have highlighted the limitations of current recognition systems, which rely heavily on rich visual inputs. In contrast, humans can interpret sparse representations, such as line drawings, with ease. The newly proposed structure-first learning paradigm leverages this insight by using line drawings as an initial training modality. This innovative approach has shown to improve model performance significantly, fostering a stronger shape bias and enhancing data efficiency across various tasks, including classification, detection, and segmentation. Notably, models trained with this method exhibit lower intrinsic dimensionality, requiring fewer principal components to capture variance, mirroring the efficient representations seen in the human brain. Furthermore, the structure-first learning paradigm enables better distillation into lightweight student models, which outperform those trained on more complex, color-supervised data. These findings not only a…

— via World Pulse Now AI Editorial System

Was this article worth reading? Share it

Recommended Readings

X-VMamba: Explainable Vision Mamba

arXiv — cs.LGa day ago

X-VMamba: Explainable Vision Mamba

PositiveArtificial Intelligence

The X-VMamba model introduces a controllability-based interpretability framework for State Space Models (SSMs), particularly the Mamba architecture. This framework aims to clarify how Vision SSMs process spatial information, which has been a challenge due to the absence of transparent mechanisms. The proposed methods include a Jacobian-based approach for any SSM architecture and a Gramian-based method for diagonal SSMs, both designed to enhance understanding of internal state dynamics while maintaining computational efficiency.

Read full article

via arXiv — cs.LG

TEyeD: Over 20 million real-world eye images with Pupil, Eyelid, and Iris 2D and 3D Segmentations, 2D and 3D Landmarks, 3D Eyeball, Gaze Vector, and Eye Movement Types

arXiv — cs.CV2 days ago

TEyeD: Over 20 million real-world eye images with Pupil, Eyelid, and Iris 2D and 3D Segmentations, 2D and 3D Landmarks, 3D Eyeball, Gaze Vector, and Eye Movement Types

PositiveArtificial Intelligence

TEyeD is the world's largest unified public dataset of eye images, featuring over 20 million images collected using seven different head-mounted eye trackers, including devices integrated into virtual and augmented reality systems. The dataset encompasses a variety of activities, such as car rides and sports, and includes detailed annotations like 2D and 3D landmarks, semantic segmentation, and gaze vectors. This resource aims to enhance research in computer vision, eye tracking, and gaze estimation.

Read full article

via arXiv — cs.CV

FairReweighing: Density Estimation-Based Reweighing Framework for Improving Separation in Fair Regression

arXiv — cs.LG2 days ago

FairReweighing: Density Estimation-Based Reweighing Framework for Improving Separation in Fair Regression

PositiveArtificial Intelligence

The article presents a new framework called FairReweighing, which utilizes density estimation to enhance fairness in regression tasks. While AI applications in various sectors have raised concerns regarding transparency and fairness across different demographic groups, most existing research has focused on binary classification. This study introduces a mutual information-based metric to evaluate separation violations and proposes a pre-processing algorithm to ensure fairness in regression models, addressing a relatively underexplored area in AI fairness.

Read full article

via arXiv — cs.LG

Dynamic Gaussian Scene Reconstruction from Unsynchronized Videos

arXiv — cs.CV2 days ago

Dynamic Gaussian Scene Reconstruction from Unsynchronized Videos

PositiveArtificial Intelligence

The paper titled 'Dynamic Gaussian Scene Reconstruction from Unsynchronized Videos' presents a novel approach to multi-view video reconstruction, crucial for applications in computer vision, film production, virtual reality, and motion analysis. The authors address the common issue of temporal misalignment in unsynchronized video streams, which can degrade reconstruction quality. They propose a temporal alignment strategy that utilizes a coarse-to-fine alignment module to estimate and compensate for time shifts between cameras, enhancing the overall reconstruction process.

Read full article

via arXiv — cs.CV

D-GAP: Improving Out-of-Domain Robustness via Dataset-Agnostic and Gradient-Guided Augmentation in Amplitude and Pixel Spaces

arXiv — cs.CV2 days ago

D-GAP: Improving Out-of-Domain Robustness via Dataset-Agnostic and Gradient-Guided Augmentation in Amplitude and Pixel Spaces

PositiveArtificial Intelligence

The article presents D-GAP (Dataset-agnostic and Gradient-guided augmentation in Amplitude and Pixel spaces), a novel approach aimed at enhancing out-of-domain (OOD) robustness in computer vision applications. Traditional augmentations often fail under varying image conditions, while D-GAP introduces targeted augmentations in both amplitude and pixel spaces. This method addresses the learning bias of neural networks towards domain-specific frequency components, leading to improved performance across diverse datasets.

Read full article

via arXiv — cs.CV

Benchmarking Retrieval-Augmented Large Language Models in Biomedical NLP: Application, Robustness, and Self-Awareness

arXiv — cs.CL2 days ago

Benchmarking Retrieval-Augmented Large Language Models in Biomedical NLP: Application, Robustness, and Self-Awareness

NeutralArtificial Intelligence

The paper titled 'Benchmarking Retrieval-Augmented Large Language Models in Biomedical NLP: Application, Robustness, and Self-Awareness' discusses the capabilities of large language models (LLMs) in biomedical natural language processing (NLP) tasks. It highlights the sensitivity of LLMs to demonstration selection and addresses the hallucination issue through retrieval-augmented LLMs (RAL). However, there is a lack of rigorous evaluation of RAL's impact on various biomedical NLP tasks, which complicates understanding its capabilities in this domain.

Read full article

via arXiv — cs.CL

Large-scale modality-invariant foundation models for brain MRI analysis: Application to lesion segmentation

arXiv — cs.LG2 days ago

Large-scale modality-invariant foundation models for brain MRI analysis: Application to lesion segmentation

NeutralArtificial Intelligence

The article discusses a significant advancement in computer vision, focusing on large-scale modality-invariant foundation models for brain MRI analysis. These models utilize self-supervised learning to leverage extensive unlabeled MRI data, enhancing performance in neuroimaging tasks such as lesion segmentation for stroke and epilepsy. The study highlights the importance of maintaining modality-specific features despite successful cross-modality alignment, and the model's code and checkpoints are publicly available.

Read full article

via arXiv — cs.LG