Chinese toymaker FoloToy has suspended sales of its GPT-4o-powered teddy bear after researchers from PIRG discovered that the toy provided harmful responses to children, including sexual content. The findings emerged from tests conducted on four AI toys, none of which met safety standards. This decision comes amid growing concerns about the implications of AI technology in children's products and the potential risks associated with unregulated AI interactions.
A recent study evaluates the performance of seven advanced large language models (LLMs) on low-resource and morphologically rich languages, specifically Cantonese, Japanese, and Turkish. The research highlights the models' effectiveness in tasks such as open-domain question answering, document summarization, translation, and culturally grounded dialogue. Despite impressive results in high-resource languages, the study indicates that the effectiveness of LLMs in these less-studied languages remains underexplored.
VP-Bench is a newly introduced benchmark designed to evaluate the ability of multimodal large language models (MLLMs) to interpret visual prompts (VPs) in images. This benchmark addresses a significant gap in existing evaluations, as no systematic assessment of MLLMs' effectiveness in recognizing VPs has been conducted. VP-Bench utilizes a two-stage evaluation framework, involving 30,000 visualized prompts across eight shapes and 355 attribute combinations, to assess MLLMs' capabilities in VP perception and utilization.
The article discusses a novel adaptive LiDAR scanning framework that enhances 3D object detection by utilizing temporal cues from past observations. Traditional LiDAR sensors often perform redundant scans, leading to inefficiencies in data acquisition and power consumption. The proposed method employs a lightweight predictor network to identify regions of interest, significantly reducing unnecessary data collection and improving overall efficiency.
The article presents Cam4DOcc, a new benchmark for camera-only 4D occupancy forecasting in autonomous driving applications. This benchmark aims to enhance the understanding of how environments change over time, which is vital for safe and reliable autonomous driving. It utilizes multiple publicly available datasets, including nuScenes, nuScenes-Occupancy, and Lyft-Level5, to evaluate the future states of surrounding objects, thereby extending current occupancy estimation techniques that primarily focus on present 3D representations.
A recent study published on arXiv investigates the use of Large Language Models (LLMs), specifically GPT-4o, for grading short-answer quizzes and project reports in an undergraduate Computational Linguistics course. The research involved approximately 50 students and 14 project teams, comparing LLM-generated scores with evaluations from teaching assistants. Results indicated a strong correlation (up to 0.98) with human graders and exact score agreement in 55% of quiz cases, highlighting both the potential and limitations of LLM-based grading systems.