arXiv:2511.03407v2 Announce Type: replace 
Abstract: Small language models (SLMs) have shown promises for relation extraction (RE) when extracting RDF triples guided by SHACL shapes focused on common datatype properties. This paper investigates how SLMs handle both datatype and object properties for a complete RDF graph extraction. We show that the key bottleneck is related to long-tail distribution of rare properties. To solve this issue, we evaluate several strategies: stratified sampling, weighted loss, dataset scaling, and template-based synthetic data augmentation. We show that the best strategy to perform equally well over unbalanced target properties is to build a training set where the number of occurrences of each property exceeds a given threshold. To enable reproducibility, we publicly released our datasets, experimental results and code. Our findings offer practical guidance for training shape-aware SLMs and highlight promising directions for future work in semantic RE.

أظهرت النماذج اللغوية الصغيرة (SLMs) وعودًا في استخراج العلاقات (RE) عند استخراج ثلاثيات RDF الموجهة بواسطة أشكال SHACL، مع التركيز بشكل خاص على خصائص نوع البيانات الشائعة. تحدد دراسة حديثة أن توزيع الذيل الطويل للخصائص النادرة يمثل عنق الزجاجة الرئيسي في التعامل مع كل من خصائص نوع البيانات وخصائص الكائنات لاستخراج كامل لرسم RDF، مقترحة عدة استراتيجيات لمعالجة هذه المشكلة.

Los pequeños modelos de lenguaje (SLMs) han demostrado potencial en la extracción de relaciones (RE) para extraer tripletas RDF guiadas por formas SHACL, centrándose especialmente en propiedades de tipo de datos comunes. Un estudio reciente identifica la distribución de larga cola de propiedades raras como un cuello de botella clave en el manejo de propiedades de tipo de datos y objeto para una extracción completa de grafos RDF, proponiendo varias estrategias para abordar este problema.

Les petits modèles de langage (SLMs) ont montré un potentiel dans l'extraction de relations (RE) pour extraire des triplets RDF guidés par des formes SHACL, en se concentrant particulièrement sur les propriétés de type de données courantes. Une étude récente identifie la distribution à longue traîne des propriétés rares comme un goulot d'étranglement clé dans la gestion des propriétés de type de données et d'objet pour une extraction complète de graphes RDF, proposant plusieurs stratégies pour résoudre ce problème.

Small language models (SLMs) have demonstrated potential in relation extraction (RE) for extracting RDF triples guided by SHACL shapes, particularly focusing on common datatype properties. A recent study identifies the challenge of long-tail distribution of rare properties as a key bottleneck in handling both datatype and object properties for comprehensive RDF graph extraction, proposing several strategies to address this issue.

Overcoming the Generalization Limits of SLM Finetuning for Shape-Based Extraction of Datatype and Object Properties

arXiv:2511.18364v1 Announce Type: cross 
Abstract: Building high-quality knowledge graphs (KGs) from diverse sources requires combining methods for information extraction, data transformation, ontology mapping, entity matching, and data fusion. Numerous methods and tools exist for each of these tasks, but support for combining them into reproducible and effective end-to-end pipelines is still lacking. We present a new framework, KGpipe for defining and executing integration pipelines that can combine existing tools or LLM (Large Language Model) functionality. To evaluate different pipelines and the resulting KGs, we propose a benchmark to integrate heterogeneous data of different formats (RDF, JSON, text) into a seed KG. We demonstrate the flexibility of KGpipe by running and comparatively evaluating several pipelines integrating sources of the same or different formats using selected performance and quality metrics.

تم تقديم KGpipe كإطار لتوليد وتقييم خطوط الأنابيب التي تدمج مصادر بيانات متنوعة في رسوم بيانية للمعرفة (KG). يعالج هذا الإطار الفجوة الموجودة في دمج طرق مختلفة لاستخراج المعلومات وتحويل البيانات ومطابقة الكيانات في حلول فعالة من البداية إلى النهاية.

KGpipe se ha presentado como un marco para generar y evaluar pipelines que integran diversas fuentes de datos en grafos de conocimiento (KG). Este marco aborda la falta de combinación de varios métodos para la extracción de información, la transformación de datos y la coincidencia de entidades en soluciones efectivas de extremo a extremo.

KGpipe a été introduit comme un cadre pour générer et évaluer des pipelines qui intègrent diverses sources de données dans des graphes de connaissances (KG). Ce cadre répond à l'écart existant dans la combinaison de diverses méthodes d'extraction d'informations, de transformation de données et de correspondance d'entités en solutions efficaces de bout en bout.

KGpipe has been introduced as a framework for generating and evaluating pipelines that integrate diverse data sources into knowledge graphs (KGs). This framework addresses the existing gap in combining various methods for information extraction, data transformation, and entity matching into effective end-to-end solutions.

Overcoming the Generalization Limits of SLM Finetuning for Shape-Based Extraction of Datatype and Object Properties

Was this article worth reading? Share it

Https

Dyad

Augmeta