Scene Graph-Guided Generative AI Framework for Synthesizing and Evaluating Industrial Hazard Scenarios

arXiv — cs.CVWednesday, November 19, 2025 at 5:00:00 AM
  • A novel scene graph
  • This development is significant as it enhances the ability to train vision models for detecting workplace hazards, potentially improving safety measures and reducing accidents in industrial environments by providing realistic training data.
— via World Pulse Now AI Editorial System

Was this article worth reading? Share it

Recommended Readings
Explaining Similarity in Vision-Language Encoders with Weighted Banzhaf Interactions
PositiveArtificial Intelligence
The paper titled 'Explaining Similarity in Vision-Language Encoders with Weighted Banzhaf Interactions' introduces a new method called FIxLIP for explaining the similarity outputs of vision-language models. This approach addresses limitations of existing saliency maps by utilizing the weighted Banzhaf interaction index from game theory, which enhances computational efficiency and flexibility. The study emphasizes the importance of understanding complex cross-modal interactions in language-image pre-training (LIP) models.
Continual Learning for Image Captioning through Improved Image-Text Alignment
PositiveArtificial Intelligence
Generating accurate and coherent image captions in a continual learning environment poses significant challenges, particularly due to catastrophic forgetting and the evolving nature of visual concepts. This study introduces a multi-loss framework for continual image captioning that leverages semantic guidance through prompt-based continual learning and contrastive alignment. The proposed method, built on a pretrained ViT-GPT-2 backbone, integrates various loss components to enhance image-text alignment without introducing additional parameters.
QwenCLIP: Boosting Medical Vision-Language Pretraining via LLM Embeddings and Prompt tuning
PositiveArtificial Intelligence
QwenCLIP is a new vision-language framework that enhances medical pretraining by integrating large language model (LLM) embeddings and learnable prompts. Traditional Contrastive Language-Image Pretraining (CLIP) struggles with long radiology reports due to its limited token capacity. By replacing CLIP's text encoder with an LLM-based module, QwenCLIP aims to improve cross-modal alignment and capture comprehensive medical semantics, addressing the limitations of existing domain-specific encoders like PubMedBERT and ClinicalBERT.
Evaluating Large Language Models for Diacritic Restoration in Romanian Texts: A Comparative Study
PositiveArtificial Intelligence
This study evaluates the effectiveness of various large language models (LLMs) in restoring diacritics in Romanian texts, a crucial task for text processing in languages with rich diacritical marks. The models tested include OpenAI's GPT-3.5, GPT-4, Google's Gemini 1.0 Pro, and Meta's Llama family, among others. Results indicate that GPT-4o achieves high accuracy in diacritic restoration, outperforming a neutral baseline, while other models show variability. The findings emphasize the importance of model architecture, training data, and prompt design in enhancing natural language processing to…
LLM-as-a-Grader: Practical Insights from Large Language Model for Short-Answer and Report Evaluation
PositiveArtificial Intelligence
This study explores the use of Large Language Models (LLMs), specifically GPT-4o, for evaluating short-answer quizzes and project reports in an undergraduate Computational Linguistics course. The research involved approximately 50 students and 14 project teams, comparing LLM-generated scores with human evaluations from teaching assistants. Results indicated a strong correlation between LLM and human scores, achieving up to 0.98 correlation and exact score agreement in 55% of quiz cases, while showing variability in scoring open-ended responses.
UniSER: A Foundation Model for Unified Soft Effects Removal
PositiveArtificial Intelligence
The paper introduces UniSER, a foundational model designed for the unified removal of soft effects in digital images, such as lens flare, haze, shadows, and reflections. These effects often degrade image aesthetics while leaving underlying pixels visible. Existing solutions typically focus on individual issues, leading to specialized models that lack scalability. In contrast, UniSER leverages the commonality of semi-transparent occlusions to effectively address various soft effect degradations, enhancing image restoration capabilities beyond current generalist models that require detailed prom…
CARScenes: Semantic VLM Dataset for Safe Autonomous Driving
PositiveArtificial Intelligence
CAR-Scenes is a frame-level dataset designed for autonomous driving, facilitating the training and evaluation of vision-language models (VLMs) for scene-level understanding. The dataset comprises 5,192 annotated images from sources like Argoverse, Cityscapes, KITTI, and nuScenes, utilizing a comprehensive 28-key category/sub-category knowledge base. The annotations are generated through a GPT-4o-assisted pipeline with human verification, providing detailed attributes and supporting semantic retrieval and risk-aware scenario mining.
Segmenting Collision Sound Sources in Egocentric Videos
PositiveArtificial Intelligence
The article presents a novel task called Collision Sound Source Segmentation (CS3), which aims to identify and segment the objects responsible for collision sounds in egocentric video footage. This task is challenging due to the nature of collision sounds arising from interactions between two objects, making it difficult to isolate the sound source visually. The proposed method utilizes weakly-supervised audio-conditioned segmentation techniques, leveraging foundation models like CLIP and SAM2, and incorporates egocentric cues to enhance object identification.