arXiv:2512.07407v1 Announce Type: new 
Abstract: Ensuring reliable tool use is critical for safe agentic AI systems. Language models frequently produce unreliable reasoning with plausible but incorrect solutions that are difficult to verify. To address this, we investigate fine-tuning models to use Prolog as an external tool for verifiable computation. Using Group Relative Policy Optimization (GRPO), we fine-tune Qwen2.5-3B-Instruct on a cleaned GSM8K-Prolog-Prover dataset while varying (i) prompt structure, (ii) reward composition (execution, syntax, semantics, structure), and (iii) inference protocol: single-shot, best-of-N, and two agentic modes where Prolog is invoked internally or independently. Our reinforcement learning approach outperforms supervised fine-tuning, with our 3B model achieving zero-shot MMLU performance comparable to 7B few-shot results. Our findings reveal that: 1) joint tuning of prompt, reward, and inference shapes program syntax and logic; 2) best-of-N with external Prolog verification maximizes accuracy on GSM8K; 3) agentic inference with internal repair yields superior zero-shot generalization on MMLU-Stem and MMLU-Pro. These results demonstrate that grounding model reasoning in formal verification systems substantially improves reliability and auditability for safety-critical applications. The source code for reproducing our experiments is available under https://github.com/niklasmellgren/grpo-prolog-inference

قام الباحثون بتطوير طريقة لضبط نماذج اللغة، وبالتحديد Qwen2.5-3B-Instruct، لاستخدام Prolog في الحسابات القابلة للتحقق. تستخدم هذه الطريقة تحسين سياسة المجموعة النسبية (GRPO) وقد أظهرت أداءً محسنًا في مهام التفكير، حيث حققت نتائج MMLU في وضع عدم الإطلاق مقارنة بالنماذج الأكبر.

Los investigadores han desarrollado un método para ajustar modelos de lenguaje, específicamente Qwen2.5-3B-Instruct, para utilizar Prolog en cálculos verificables. Este enfoque emplea la Optimización de Política Relativa de Grupo (GRPO) y ha mostrado un rendimiento mejorado en tareas de razonamiento, logrando resultados MMLU en cero disparos comparables a modelos más grandes.

Des chercheurs ont développé une méthode pour affiner les modèles de langage, en particulier Qwen2.5-3B-Instruct, afin d'utiliser Prolog pour des calculs vérifiables. Cette approche utilise l'optimisation de politique relative de groupe (GRPO) et a montré une performance améliorée dans les tâches de raisonnement, atteignant des résultats MMLU en zéro-shot comparables à ceux de modèles plus grands.

Researchers have developed a method to fine-tune language models, specifically Qwen2.5-3B-Instruct, to utilize Prolog for verifiable computation. This approach employs Group Relative Policy Optimization (GRPO) and has shown improved performance in reasoning tasks, achieving zero-shot MMLU results comparable to larger models.

Training Language Models to Use Prolog as a Tool

Was this article worth reading? Share it

LucidQuery AI

LangWatch

Keywords AI