arXiv:2511.20653v1 Announce Type: cross 
Abstract: Large language models (LLMs) are increasingly used to answer high-stakes study-abroad questions about admissions, visas, scholarships, and eligibility. Yet it remains unclear how reliably they advise students, and how often otherwise helpful answers drift into unsupported claims (``hallucinations'').
  This work provides a clear, domain-grounded overview of how current LLMs behave in this setting. Using realistic questions set drawn from ApplyBoard's advising workflows -- an EdTech platform that supports students from discovery to enrolment -- we evaluate two essentials side by side: accuracy (is the information correct and complete?) and hallucination (does the model add content not supported by the question or domain evidence). These questions are categorized by domain scope which can be a single-domain or multi-domain -- when it must integrate evidence across areas such as admissions, visas, and scholarships.
  To reflect real advising quality, we grade answers with a simple rubric which is correct, partial, or wrong. The rubric is domain-coverage-aware: an answer can be partial if it addresses only a subset of the required domains, and it can be over-scoped if it introduces extra, unnecessary domains; both patterns are captured in our scoring as under-coverage or reduced relevance/hallucination.
  We also report measures of faithfulness and answer relevance, alongside an aggregate hallucination score, to capture relevance and usefulness. All models are tested with the same questions for a fair, head-to-head comparison.
  Our goals are to: (1) give a clear picture of which models are most dependable for study-abroad advising, (2) surface common failure modes -- where answers are incomplete, off-topic, or unsupported, and (3) offer a practical, reusable protocol for auditing LLMs before deployment in education and advising contexts.

أجرى دراسة حديثة تقييمًا لمدى موثوقية نماذج اللغة الكبيرة (LLMs) في تقديم الإرشادات للطلاب الدوليين حول مواضيع حيوية مثل القبول والتأشيرات. استندت الأبحاث إلى أسئلة واقعية من سير عمل الاستشارات في ApplyBoard، وقيمت دقة المعلومات المقدمة ووجود ادعاءات غير مدعومة، تُعرف باسم الهلوسات.

Un estudio reciente evaluó la fiabilidad de los modelos de lenguaje de gran tamaño (LLMs) en la provisión de orientación a estudiantes internacionales sobre temas críticos como admisiones y visados. La investigación, basada en preguntas realistas de los flujos de trabajo de asesoramiento de ApplyBoard, evaluó tanto la precisión de la información proporcionada como la ocurrencia de afirmaciones no respaldadas, conocidas como alucinaciones.

Une étude récente a évalué la fiabilité des modèles de langage de grande taille (LLMs) dans la fourniture de conseils aux étudiants internationaux sur des sujets critiques tels que les admissions et les visas. La recherche, basée sur des questions réalistes provenant des flux de travail de conseil d'ApplyBoard, a évalué à la fois l'exactitude des informations fournies et l'occurrence de déclarations non soutenues, connues sous le nom d'hallucinations.

A recent study evaluated the reliability of large language models (LLMs) in providing guidance to international students on critical topics such as admissions and visas. The research, based on realistic questions from ApplyBoard's advising workflows, assessed both the accuracy of the information provided and the occurrence of unsupported claims, known as hallucinations.

Domain-Grounded Evaluation of LLMs in International Student Knowledge

One More Thing in AI – Your Shortcut to AI Mastery

Domain-Grounded Evaluation of LLMs in International Student Knowledge

Was this article worth reading? Share it

One More Thing in AI

LucidQuery AI

Langfuse

Langtail

ConsoleX

Linkjob AI

Ready to build your own newsroom?