arXiv:2511.20182v1 Announce Type: new 
Abstract: Kyrgyz remains a low-resource language with limited foundational NLP tools. To address this gap, we introduce KyrgyzBERT, the first publicly available monolingual BERT-based language model for Kyrgyz. The model has 35.9M parameters and uses a custom tokenizer designed for the language's morphological structure. To evaluate performance, we create kyrgyz-sst2, a sentiment analysis benchmark built by translating the Stanford Sentiment Treebank and manually annotating the full test set. KyrgyzBERT fine-tuned on this dataset achieves an F1-score of 0.8280, competitive with a fine-tuned mBERT model five times larger. All models, data, and code are released to support future research in Kyrgyz NLP.

تم تقديم KyrgyzBERT كأول نموذج لغوي أحادي اللغة يعتمد على BERT مصمم خصيصًا للغة القيرغيزية، ويحتوي على 35.9 مليون معلمة ومُعالج مخصص. يعالج هذا النموذج ندرة أدوات معالجة اللغة الطبيعية الأساسية للغة القيرغيزية، التي تُصنف كلغة ذات موارد منخفضة. شملت تقييمات الأداء إنشاء معيار لتحليل المشاعر، kyrgyz-sst2، الذي حقق درجة F1 تنافسية تبلغ 0.8280 عند تحسينه على مجموعة البيانات.

KyrgyzBERT ha sido presentado como el primer modelo de lenguaje monolingüe basado en BERT diseñado específicamente para el idioma kirguís, con 35,9 millones de parámetros y un tokenizador personalizado. Este modelo aborda la escasez de herramientas fundamentales de procesamiento de lenguaje natural (NLP) para el kirguís, que se clasifica como un idioma de bajos recursos. La evaluación del rendimiento incluyó la creación de un banco de pruebas de análisis de sentimientos, kyrgyz-sst2, logrando un competitivo puntaje F1 de 0.8280 al ser ajustado con el conjunto de datos.

KyrgyzBERT a été introduit comme le premier modèle de langage monolingue basé sur BERT spécifiquement conçu pour la langue kirghize, avec 35,9 millions de paramètres et un tokenizer personnalisé. Ce modèle répond à la rareté des outils NLP fondamentaux pour le kirghize, qui est classé comme une langue à faibles ressources. L'évaluation des performances a inclus la création d'une référence d'analyse de sentiment, kyrgyz-sst2, atteignant un score F1 compétitif de 0,8280 lorsqu'il est affiné sur l'ensemble de données.

KyrgyzBERT has been introduced as the first publicly available monolingual BERT-based language model specifically designed for the Kyrgyz language, featuring 35.9 million parameters and a custom tokenizer. This model addresses the scarcity of foundational NLP tools for Kyrgyz, which is classified as a low-resource language. The performance evaluation included the creation of a sentiment analysis benchmark, kyrgyz-sst2, achieving a competitive F1-score of 0.8280 when fine-tuned on the dataset.

KyrgyzBERT: A Compact, Efficient Language Model for Kyrgyz NLP

arXiv:2512.08809v1 Announce Type: cross 
Abstract: With the rise of large language models, service providers offer language models as a service, enabling users to fine-tune customized models via uploaded private datasets. However, this raises concerns about sensitive data leakage. Prior methods, relying on differential privacy within device-cloud collaboration frameworks, struggle to balance privacy and utility, exposing users to inference attacks or degrading fine-tuning performance. To address this, we propose PrivTune, an efficient and privacy-preserving fine-tuning framework via Split Learning (SL). The key idea of PrivTune is to inject crafted noise into token representations from the SL bottom model, making each token resemble the $n$-hop indirect neighbors. PrivTune formulates this as an optimization problem to compute the optimal noise vector, aligning with defense-utility goals. On this basis, it then adjusts the parameters (i.e., mean) of the $d_\chi$-Privacy noise distribution to align with the optimization direction and scales the noise according to token importance to minimize distortion. Experiments on five datasets (covering both classification and generation tasks) against three embedding inversion and three attribute inference attacks show that, using RoBERTa on the Stanford Sentiment Treebank dataset, PrivTune reduces the attack success rate to 10% with only a 3.33% drop in utility performance, outperforming state-of-the-art baselines.

تم تقديم PrivTune كإطار جديد لضبط نماذج اللغة الكبيرة مع الحفاظ على خصوصية المستخدم من خلال التعاون بين الأجهزة والسحابة. يتناول التحديات المتعلقة بتسرب البيانات وتدهور الأداء المرتبطة بالطرق التقليدية من خلال استخدام التعلم المنقسم لحقن الضوضاء في تمثيلات الرموز، مما يعزز الأمان ضد هجمات الاستدلال.

PrivTune se ha presentado como un nuevo marco para el ajuste fino de grandes modelos de lenguaje, preservando la privacidad del usuario a través de la colaboración entre dispositivos y la nube. Aborda los desafíos de la filtración de datos y la degradación del rendimiento asociados con los métodos tradicionales al utilizar el Aprendizaje Dividido para inyectar ruido en las representaciones de tokens, mejorando así la seguridad contra ataques de inferencia.

PrivTune a été introduit comme un nouveau cadre pour le fine-tuning des grands modèles de langage tout en préservant la vie privée des utilisateurs grâce à la collaboration entre appareils et cloud. Il répond aux défis de la fuite de données et de la dégradation des performances associés aux méthodes traditionnelles en utilisant l'apprentissage fractionné pour injecter du bruit dans les représentations de tokens, renforçant ainsi la sécurité contre les attaques d'inférence.

PrivTune has been introduced as a novel framework for fine-tuning large language models while preserving user privacy through device-cloud collaboration. It addresses the challenges of data leakage and performance degradation associated with traditional methods by utilizing Split Learning to inject noise into token representations, thereby enhancing security against inference attacks.

KyrgyzBERT: A Compact, Efficient Language Model for Kyrgyz NLP

Was this article worth reading? Share it

Airparser

Humanize AI

AQ