arXiv:2511.10871v1 Announce Type: new 
Abstract: LLMs are increasingly employed as judges across a variety of tasks, including those involving everyday social interactions. Yet, it remains unclear whether such LLM-judges can reliably assess tasks that require social or conversational judgment. We investigate how an LLM's conviction is changed when a task is reframed from a direct factual query to a Conversational Judgment Task. Our evaluation framework contrasts the model's performance on direct factual queries with its assessment of a speaker's correctness when the same information is presented within a minimal dialogue, effectively shifting the query from "Is this statement correct?" to "Is this speaker correct?". Furthermore, we apply pressure in the form of a simple rebuttal ("The previous answer is incorrect.") to both conditions. This perturbation allows us to measure how firmly the model maintains its position under conversational pressure. Our findings show that while some models like GPT-4o-mini reveal sycophantic tendencies under social framing tasks, others like Llama-8B-Instruct become overly-critical. We observe an average performance change of 9.24% across all models, demonstrating that even minimal dialogue context can significantly alter model judgment, underscoring conversational framing as a key factor in LLM-based evaluation. The proposed framework offers a reproducible methodology for diagnosing model conviction and contributes to the development of more trustworthy dialogue systems.

تستكشف المقالة تأثير تأطير المهام على قناعة نماذج اللغة الكبيرة (LLMs) في أنظمة الحوار. تتناول كيفية تقييم نماذج LLM للمهام التي تتطلب حكمًا اجتماعيًا، من خلال مقارنة أدائها في الاستفسارات الواقعية مع مهام الحكم المحادثاتي. تكشف الدراسة أن إعادة تأطير المهمة يمكن أن تغير بشكل كبير حكم نموذج LLM، خاصة تحت الضغط المحادثاتي، مما يبرز تعقيدات اتخاذ القرار لدى نماذج LLM في السياقات الاجتماعية.

El artículo investiga el impacto del encuadre de tareas en la convicción de los modelos de lenguaje de gran tamaño (LLMs) en sistemas de diálogo. Explora cómo los LLMs evalúan tareas que requieren juicio social, contrastando su rendimiento en consultas fácticas con tareas de juicio conversacional. El estudio revela que el recuadro de una tarea puede alterar significativamente el juicio de un LLM, especialmente bajo presión conversacional, destacando las complejidades de la toma de decisiones de los LLM en contextos sociales.

L'article examine l'impact du cadrage des tâches sur la conviction des modèles de langage de grande taille (LLMs) dans les systèmes de dialogue. Il explore comment les LLMs évaluent des tâches nécessitant un jugement social, en contrastant leur performance sur des requêtes factuelles avec des tâches de jugement conversationnel. L'étude révèle que le recadrage d'une tâche peut modifier de manière significative le jugement d'un LLM, en particulier sous pression conversationnelle, soulignant les complexités de la prise de décision des LLM dans des contextes sociaux.

The article investigates the impact of task framing on the conviction of large language models (LLMs) in dialogue systems. It explores how LLMs assess tasks requiring social judgment, contrasting their performance on factual queries with conversational judgment tasks. The study reveals that reframing a task can significantly alter an LLM's judgment, particularly under conversational pressure, highlighting the complexities of LLM decision-making in social contexts.

From Fact to Judgment: Investigating the Impact of Task Framing on LLM Conviction in Dialogue Systems

Was this article worth reading? Share it

LucidQuery AI

MyFramework

Langfuse

Langtail

Usercall

Supametas.AI

Ready to build your own newsroom?