arXiv:2504.21194v2 Announce Type: replace 
Abstract: This paper introduces ISS-Geo142, a curated benchmark for geolocating astronaut photography captured from the International Space Station (ISS). Although the ISS position at capture time is known precisely, the specific Earth locations depicted in these images are typically not directly georeferenced, making automated localization non-trivial. ISS-Geo142 consists of 142 images with associated metadata and manually determined geographic locations, spanning a range of spatial scales and scene types.
  On top of this benchmark, we implement and evaluate three geolocation pipelines: a neural network based approach (NN-Geo) using VGG16 features and cross-correlation over map-derived Areas of Interest (AOIs), a Scale-Invariant Feature Transform based pipeline (SIFT-Match) using sliding-window feature matching on stitched high-resolution AOIs, and TerraByte, an AI system built around a GPT-4 model with vision capabilities that jointly reasons over image content and ISS coordinates. On ISS-Geo142, NN-Geo achieves a match for 75.52\% of the images under our evaluation protocol, SIFT-Match attains high precision on structurally rich scenes at substantial computational cost, and TerraByte establishes the strongest overall baseline, correctly geolocating approximately 90\% of the images while also producing human-readable geographic descriptions.
  The methods and experiments were originally developed in 2023; this manuscript is a revised and extended version that situates the work relative to subsequent advances in cross-view geo-localization and remote-sensing vision--language models. Taken together, ISS-Geo142 and these three pipelines provide a concrete, historically grounded benchmark for future work on ISS image geolocation.

يمثل تقديم ISS-Geo142 تقدمًا كبيرًا في تحديد المواقع الجغرافية لصور رواد الفضاء الملتقطة من محطة الفضاء الدولية (ISS). يتضمن هذا المعيار 142 صورة مع بيانات وصفية مفصلة ومواقع جغرافية، مما يعالج التحدي المتمثل في تحديد المواقع الأرضية بدقة في صور محطة الفضاء، والتي لا يتم عادةً تحديد مواقعها جغرافيًا.

La introducción de ISS-Geo142 marca un avance significativo en la geolocalización de la fotografía de astronautas desde la Estación Espacial Internacional (ISS). Este estándar incluye 142 imágenes con metadatos detallados y ubicaciones geográficas, abordando el desafío de identificar con precisión las ubicaciones en la Tierra en las imágenes de la ISS, que no suelen estar georreferenciadas.

L'introduction d'ISS-Geo142 représente une avancée significative dans la géolocalisation de la photographie d'astronautes depuis la Station spatiale internationale (ISS). Ce référentiel comprend 142 images avec des métadonnées détaillées et des emplacements géographiques, répondant au défi d'identifier avec précision les emplacements terrestres dans les images de l'ISS, qui ne sont généralement pas géoréférencées.

The introduction of ISS-Geo142 marks a significant advancement in the geolocation of astronaut photography from the International Space Station (ISS). This benchmark includes 142 images with detailed metadata and geographic locations, addressing the challenge of accurately identifying Earth locations in ISS images, which are not typically georeferenced.

ISS-Geo142: A Benchmark for Geolocating Astronaut Photography from the International Space Station

arXiv:2511.17220v1 Announce Type: cross 
Abstract: This study presents PARROT (Persuasion and Agreement Robustness Rating of Output Truth), a robustness focused framework designed to measure the degradation in accuracy that occurs under social pressure exerted on users through authority and persuasion in large language models (LLMs) the phenomenon of sycophancy (excessive conformity). PARROT (i) isolates causal effects by comparing the neutral version of the same question with an authoritatively false version using a double-blind evaluation, (ii) quantifies confidence shifts toward the correct and imposed false responses using log-likelihood-based calibration tracking, and (iii) systematically classifies failure modes (e.g., robust correct, sycophantic agreement, reinforced error, stubborn error, self-correction, etc.) using an eight-state behavioral taxonomy. We evaluated 22 models using 1,302 MMLU-style multiple-choice questions across 13 domains and domain-specific authority templates. Findings show marked heterogeneity: advanced models (e.g., GPT-5, GPT-4.1, Claude Sonnet 4.5) exhibit low "follow rates" ($\leq 11\%$, GPT-5: 4\%) and minimal accuracy loss, while older/smaller models show severe epistemic collapse (GPT-4: 80\%, Qwen 2.5-1.5B: 94\%). The danger is not limited to response changes; weak models reduce confidence in the correct response while increasing confidence in the imposed incorrect response. While international law and global knowledge at the domain level exhibit high fragility, elementary mathematics is relatively resilient. Consequently, we argue that the goal of "resistance to overfitting pressure" should be addressed as a primary objective alongside accuracy, harm avoidance, and privacy for safe deployment in the real world.

تقدم الدراسة إطار PARROT، الذي يهدف إلى تقييم تدهور الدقة في نماذج اللغة الكبيرة (LLMs) تحت الضغط الاجتماعي، مع التركيز بشكل خاص على ظاهرة التملق. من خلال مقارنة النسخ المحايدة مع النسخ الزائفة ذات السلطة، يسعى PARROT إلى قياس التغيرات في الثقة وتصنيف أنماط الفشل المختلفة عبر 22 نموذجًا تم تقييمها باستخدام 1,302 سؤالًا عبر 13 مجالًا.

El estudio presenta PARROT, un marco diseñado para evaluar la degradación de la precisión en los modelos de lenguaje de gran tamaño (LLMs) bajo presión social, centrándose especialmente en el fenómeno de la sycophancy. Al comparar respuestas neutrales y falsamente autoritarias, PARROT busca cuantificar los cambios de confianza y clasificar varios modos de fallo en 22 modelos evaluados con 1,302 preguntas en 13 dominios.

L'étude présente PARROT, un cadre conçu pour évaluer la dégradation de l'exactitude des modèles de langage de grande taille (LLMs) sous pression sociale, en se concentrant particulièrement sur le phénomène de sycophantie. En comparant des réponses neutres et faussement autoritaires, PARROT vise à quantifier les changements de confiance et à classer divers modes d'échec à travers 22 modèles évalués avec 1 302 questions dans 13 domaines.

The study introduces PARROT, a framework designed to assess the accuracy degradation in large language models (LLMs) under social pressure, particularly focusing on the phenomenon of sycophancy. By comparing neutral and authoritatively false responses, PARROT aims to quantify confidence shifts and classify various failure modes across 22 models evaluated with 1,302 questions across 13 domains.

Parrot: Persuasion and Agreement Robustness Rating of Output Truth -- A Sycophancy Robustness Benchmark for LLMs

arXiv:2511.17301v1 Announce Type: new 
Abstract: Sentiment analysis can aid in understanding people's opinions and emotions on social issues. In multilingual communities sentiment analysis systems can be used to quickly identify social challenges in social media posts, enabling government departments to detect and address these issues more precisely and effectively. Recently, large-language models (LLMs) have become available to the wide public and initial analyses have shown that they exhibit magnificent zero-shot sentiment analysis abilities in English. However, there is no work that has investigated to leverage LLMs for sentiment analysis on social media posts in South African languages and detect social challenges. Consequently, in this work, we analyse the zero-shot performance of the state-of-the-art LLMs GPT-3.5, GPT-4, LlaMa 2, PaLM 2, and Dolly 2 to investigate the sentiment polarities of the 10 most emerging topics in English, Sepedi and Setswana social media posts that fall within the jurisdictional areas of 10 South African government departments. Our results demonstrate that there are big differences between the various LLMs, topics, and languages. In addition, we show that a fusion of the outcomes of different LLMs provides large gains in sentiment classification performance with sentiment classification errors below 1%. Consequently, it is now feasible to provide systems that generate reliable information about sentiment analysis to detect social challenges and draw conclusions about possible needs for actions on specific topics and within different language groups.

استكشفت الأبحاث الحديثة تطبيق نماذج اللغة الكبيرة (LLMs) لتحليل المشاعر في اللغات الجنوب أفريقية، مع التركيز على قدرتها على اكتشاف التحديات الاجتماعية من خلال المنشورات على وسائل التواصل الاجتماعي. تقيّم الدراسة بشكل خاص أداء النماذج مثل GPT-3.5 وGPT-4 وLlaMa 2 وPaLM 2 وDolly 2 في تحليل قطبية المشاعر عبر مواضيع باللغة الإنجليزية والسبيدي والسيتسوانا.

Investigaciones recientes han explorado la aplicación de grandes modelos de lenguaje (LLMs) para el análisis de sentimientos en lenguas sudafricanas, centrándose en su capacidad para detectar desafíos sociales a través de publicaciones en redes sociales. El estudio evalúa específicamente el rendimiento en zero-shot de modelos como GPT-3.5, GPT-4, LlaMa 2, PaLM 2 y Dolly 2 en el análisis de las polaridades de sentimientos en temas en inglés, sepedi y setswana.

Des recherches récentes ont exploré l'application de grands modèles de langage (LLMs) pour l'analyse des sentiments dans les langues sud-africaines, en se concentrant sur leur capacité à détecter les défis sociaux à travers les publications sur les réseaux sociaux. L'étude évalue spécifiquement la performance en zero-shot de modèles tels que GPT-3.5, GPT-4, LlaMa 2, PaLM 2 et Dolly 2 dans l'analyse des polarités des sentiments sur des sujets en anglais, sepedi et setswana.

Recent research has explored the application of large language models (LLMs) for sentiment analysis in South African languages, focusing on their ability to detect social challenges through social media posts. The study specifically evaluates the zero-shot performance of models like GPT-3.5, GPT-4, LlaMa 2, PaLM 2, and Dolly 2 in analyzing sentiment polarities across topics in English, Sepedi, and Setswana.

Large Language Models for Sentiment Analysis to Detect Social Challenges: A Use Case with South African Languages

arXiv:2511.17380v1 Announce Type: cross 
Abstract: Deep learning (DL) models, despite their remarkable success, remain vulnerable to small input perturbations that can cause erroneous outputs, motivating the recent proposal of probabilistic robustness (PR) as a complementary alternative to adversarial robustness (AR). However, existing PR formulations assume a fixed and known perturbation distribution, an unrealistic expectation in practice. To address this limitation, we propose non-parametric probabilistic robustness (NPPR), a more practical PR metric that does not rely on any predefined perturbation distribution. Following the non-parametric paradigm in statistical modeling, NPPR learns an optimized perturbation distribution directly from data, enabling conservative PR evaluation under distributional uncertainty. We further develop an NPPR estimator based on a Gaussian Mixture Model (GMM) with Multilayer Perceptron (MLP) heads and bicubic up-sampling, covering various input-dependent and input-independent perturbation scenarios. Theoretical analyses establish the relationships among AR, PR, and NPPR. Extensive experiments on CIFAR-10, CIFAR-100, and Tiny ImageNet across ResNet18/50, WideResNet50 and VGG16 validate NPPR as a more practical robustness metric, showing up to 40\% more conservative (lower) PR estimates compared to assuming those common perturbation distributions used in state-of-the-arts.

تم اقتراح نهج جديد للمتانة الاحتمالية في التعلم العميق، يسمى المتانة الاحتمالية غير المعلمية (NPPR)، والذي يتعلم توزيعات الاضطراب المحسّنة مباشرة من البيانات بدلاً من الاعتماد على توزيعات ثابتة. تهدف هذه الطريقة إلى تحسين تقييم متانة النموذج في ظل عدم اليقين التوزيعي، مما يعالج قيدًا كبيرًا في الأطر الحالية للمتانة الاحتمالية.

Se ha propuesto un nuevo enfoque para la robustez probabilística en el aprendizaje profundo, denominado robustez probabilística no paramétrica (NPPR), que aprende distribuciones de perturbación optimizadas directamente de los datos en lugar de depender de distribuciones fijas. Este método busca mejorar la evaluación de la robustez del modelo ante la incertidumbre de distribución, abordando una limitación significativa en los marcos de robustez probabilística existentes.

Une nouvelle approche de la robustesse probabiliste dans l'apprentissage profond, appelée robustesse probabiliste non paramétrique (NPPR), a été proposée, apprenant des distributions de perturbation optimisées directement à partir des données plutôt que de s'appuyer sur des distributions fixes. Cette méthode vise à améliorer l'évaluation de la robustesse des modèles face à l'incertitude distributionnelle, abordant une limitation significative des cadres de robustesse probabiliste existants.

A new approach to probabilistic robustness in deep learning, termed non-parametric probabilistic robustness (NPPR), has been proposed, which learns optimized perturbation distributions directly from data rather than relying on fixed distributions. This method aims to enhance the evaluation of model robustness under distributional uncertainty, addressing a significant limitation in existing probabilistic robustness frameworks.

Non-Parametric Probabilistic Robustness: A Conservative Metric with Optimized Perturbation Distributions

arXiv:2511.13182v3 Announce Type: replace 
Abstract: Automatic diacritic restoration is crucial for text processing in languages with rich diacritical marks, such as Romanian. This study evaluates the performance of several large language models (LLMs) in restoring diacritics in Romanian texts. Using a comprehensive corpus, we tested models including OpenAI's GPT-3.5, GPT-4, GPT-4o, Google's Gemini 1.0 Pro, Meta's Llama 2 and Llama 3, MistralAI's Mixtral 8x7B Instruct, airoboros 70B, and OpenLLM-Ro's RoLlama 2 7B, under multiple prompt templates ranging from zero-shot to complex multi-shot instructions. Results show that models such as GPT-4o achieve high diacritic restoration accuracy, consistently surpassing a neutral echo baseline, while others, including Meta's Llama family, exhibit wider variability. These findings highlight the impact of model architecture, training data, and prompt design on diacritic restoration performance and outline promising directions for improving NLP tools for diacritic-rich languages.

أظهرت دراسة حديثة تقييم أداء عدة نماذج لغوية كبيرة (LLMs) في استعادة التشكيلات في النصوص الرومانية، مما يبرز أهمية الاستعادة التلقائية للتشكيلات من أجل معالجة فعالة للنصوص في اللغات الغنية بالعلامات التشكيلية. النماذج التي تم اختبارها شملت GPT-3.5 وGPT-4 من OpenAI، بالإضافة إلى Gemini 1.0 Pro من Google، حيث حقق GPT-4o دقة ملحوظة في استعادة التشكيلات.

Un estudio reciente evaluó el rendimiento de varios modelos de lenguaje de gran tamaño (LLMs) en la restauración de diacríticos en textos rumanos, destacando la importancia de la restauración automática de diacríticos para un procesamiento efectivo de textos en lenguas ricas en marcas diacríticas. Los modelos probados incluyeron GPT-3.5 y GPT-4 de OpenAI, así como Gemini 1.0 Pro de Google, entre otros, con GPT-4o logrando una notable precisión en la restauración de diacríticos.

Une étude récente a évalué la performance de divers modèles de langage de grande taille (LLMs) dans la restauration des diacritiques dans les textes roumains, soulignant l'importance de la restauration automatique des diacritiques pour un traitement efficace des textes dans les langues riches en marques diacritiques. Les modèles testés comprenaient GPT-3.5 et GPT-4 d'OpenAI, ainsi que Gemini 1.0 Pro de Google, entre autres, avec GPT-4o atteignant une précision notable dans la restauration des diacritiques.

A recent study evaluated the performance of various large language models (LLMs) in restoring diacritics in Romanian texts, highlighting the importance of automatic diacritic restoration for effective text processing in languages rich in diacritical marks. Models tested included OpenAI's GPT-3.5, GPT-4, and Google's Gemini 1.0 Pro, among others, with GPT-4o achieving notable accuracy in diacritic restoration.

ISS-Geo142: A Benchmark for Geolocating Astronaut Photography from the International Space Station

Was this article worth reading? Share it