arXiv:2512.03976v1 Announce Type: new 
Abstract: Adapting large language models (LLMs) to low-resource languages remains a major challenge due to data scarcity and cross-lingual drift. This work presents a two-stage adaptation of Qwen2.5-3B to Tibetan, a morphologically rich and underrepresented language. We employ Continual Pretraining (CPT) to establish Tibetan linguistic grounding, followed by Supervised Fine-Tuning (SFT) for task and translation specialization. Empirical evaluations demonstrate a consistent decrease in perplexity (from 2.98 $\rightarrow$ 1.54) and substantial improvements in Chinese$\rightarrow$Tibetan translation quality (BLEU: 0.046 $\rightarrow$ 0.261; chrF: 2.2 $\rightarrow$ 6.6). Layer-wise analysis across 435 layers in Qwen3-4B reveals that adaptation primarily concentrates on embedding and output heads, with mid--late MLP projections encoding domain-specific transformations. Our findings suggest that CPT constructs a Tibetan semantic manifold while SFT sharpens task alignment with minimal representational disruption. This study provides the first quantitative exploration of Tibetan adaptation dynamics for LLMs, and offers an open, reproducible framework for extending multilingual foundation models to low-resource settings.

نجح بحث في تكييف نموذج اللغة Qwen2.5-3B مع اللغة التبتية من خلال عملية من مرحلتين تشمل إعادة التدريب المستمر (CPT) والتعديل الخاضع للإشراف (SFT). يتناول هذا التكيف التحديات المتعلقة بنقص البيانات والانحراف بين اللغات، مما يؤدي إلى تحسينات كبيرة في جودة الترجمة وتقليل مقاييس الارتباك.

Un estudio ha logrado adaptar el modelo de lenguaje Qwen2.5-3B al idioma tibetano mediante un proceso de dos etapas que incluye el preentrenamiento continuo (CPT) y el ajuste fino supervisado (SFT). Esta adaptación aborda los desafíos de la escasez de datos y el desvío interlingüístico, resultando en mejoras significativas en la calidad de la traducción y una reducción en las métricas de perplejidad.

Une étude a réussi à adapter le modèle de langage Qwen2.5-3B à la langue tibétaine grâce à un processus en deux étapes impliquant un pré-entraînement continu (CPT) et un ajustement supervisé (SFT). Cette adaptation répond aux défis de la rareté des données et du dérive interlinguale, entraînant des améliorations significatives de la qualité de la traduction et une réduction des métriques de perplexité.

A study has successfully adapted the Qwen2.5-3B large language model to the Tibetan language through a two-stage process involving Continual Pretraining (CPT) and Supervised Fine-Tuning (SFT). This adaptation addresses the challenges of data scarcity and cross-lingual drift, resulting in significant improvements in translation quality and a reduction in perplexity metrics.

Adapting Large Language Models to Low-Resource Tibetan: A Two-Stage Continual and Supervised Fine-Tuning Study

arXiv:2512.03676v1 Announce Type: new 
Abstract: Large language models (LLMs) can reliably distinguish grammatical from ungrammatical sentences, but how grammatical knowledge is represented within the models remains an open question. We investigate whether different syntactic phenomena recruit shared or distinct components in LLMs. Using a functional localization approach inspired by cognitive neuroscience, we identify the LLM units most responsive to 67 English syntactic phenomena in seven open-weight models. These units are consistently recruited across sentences containing the phenomena and causally support the models' syntactic performance. Critically, different types of syntactic agreement (e.g., subject-verb, anaphor, determiner-noun) recruit overlapping sets of units, suggesting that agreement constitutes a meaningful functional category for LLMs. This pattern holds in English, Russian, and Chinese; and further, in a cross-lingual analysis of 57 diverse languages, structurally more similar languages share more units for subject-verb agreement. Taken together, these findings reveal that syntactic agreement-a critical marker of syntactic dependencies-constitutes a meaningful category within LLMs' representational spaces.

أظهرت الأبحاث الأخيرة أن نماذج اللغة الكبيرة (LLMs) يمكنها التمييز بفعالية بين الجمل النحوية وغير النحوية، مما يكشف أن أنواعًا مختلفة من الاتفاق النحوي، مثل الفاعل والفعل والمحدد والاسم، تستخدم وحدات متداخلة داخل هذه النماذج. استخدمت هذه الدراسة نهجًا محليًا وظيفيًا لتحديد الوحدات المستجيبة عبر 67 ظاهرة نحوية باللغة الإنجليزية في سبعة نماذج ذات أوزان مفتوحة.

Investigaciones recientes han demostrado que los modelos de lenguaje de gran tamaño (LLMs) pueden diferenciar eficazmente entre oraciones gramaticales y no gramaticales, revelando que varios tipos de acuerdo sintáctico, como el sujeto-verbo y el determinante-sustantivo, utilizan unidades superpuestas dentro de estos modelos. Este estudio utilizó un enfoque de localización funcional para identificar las unidades reactivas en 67 fenómenos sintácticos en inglés en siete modelos de pesos abiertos.

Des recherches récentes ont montré que les modèles de langage de grande taille (LLMs) peuvent différencier efficacement les phrases grammaticales des phrases non grammaticales, révélant que divers types d'accords syntaxiques, tels que l'accord sujet-verbe et l'accord déterminant-nom, utilisent des unités qui se chevauchent au sein de ces modèles. Cette étude a impliqué une approche de localisation fonctionnelle pour identifier les unités réactives à travers 67 phénomènes syntaxiques en anglais dans sept modèles à poids ouverts.

Recent research has shown that large language models (LLMs) can effectively differentiate between grammatical and ungrammatical sentences, revealing that various types of syntactic agreement, such as subject-verb and determiner-noun, utilize overlapping units within these models. This study involved a functional localization approach to identify the responsive units across 67 English syntactic phenomena in seven open-weight models.

Adapting Large Language Models to Low-Resource Tibetan: A Two-Stage Continual and Supervised Fine-Tuning Study

Was this article worth reading? Share it