arXiv:2511.08537v1 Announce Type: new 
Abstract: This report presents a detailed methodology for constructing a high-quality Semantic Role Labeling (SRL) dataset from the Wall Street Journal (WSJ) portion of the OntoNotes 5.0 corpus and adapting it for Opinion Role Labeling (ORL) tasks. Leveraging the PropBank annotation framework, we implement a reproducible extraction pipeline that aligns predicate-argument structures with surface text, converts syntactic tree pointers to coherent spans, and applies rigorous cleaning to ensure semantic fidelity. The resulting dataset comprises 97,169 predicate-argument instances with clearly defined Agent (ARG0), Predicate (REL), and Patient (ARG1) roles, mapped to ORL's Holder, Expression, and Target schema. We provide a detailed account of our extraction algorithms, discontinuous argument handling, annotation corrections, and statistical analysis of the resulting dataset. This work offers a reusable resource for researchers aiming to leverage SRL for enhancing ORL, especially in low-resource opinion mining scenarios.

تم تطوير منهجية جديدة لبناء مجموعة بيانات عالية الجودة لتصنيف الأدوار الدلالية (SRL) من جزء صحيفة وول ستريت جورنال من مجموعة بيانات OntoNotes 5.0، مع تكييفها لمهام تصنيف الأدوار الرأي (ORL). تتضمن مجموعة البيانات هذه 97,169 حالة من حالات الفعل-الحجة، وهي ضرورية لتحسين استخراج الآراء، خاصة في السيناريوهات ذات الموارد المنخفضة. هذا العمل مهم لأنه يوفر موردًا قابلاً لإعادة الاستخدام للباحثين في هذا المجال.

Se ha desarrollado una nueva metodología para construir un conjunto de datos de etiquetado de roles semánticos (SRL) de alta calidad a partir de la porción del Wall Street Journal del corpus OntoNotes 5.0, adaptándolo para tareas de etiquetado de roles de opinión (ORL). Este conjunto de datos incluye 97,169 instancias de predicado-argumento, cruciales para mejorar la minería de opiniones, especialmente en escenarios de bajos recursos. Este trabajo es significativo ya que proporciona un recurso reutilizable para los investigadores en el campo.

Une nouvelle méthodologie pour construire un ensemble de données de labellisation de rôles sémantiques (SRL) de haute qualité à partir de la portion du Wall Street Journal du corpus OntoNotes 5.0 a été développée, l'adaptant pour des tâches de labellisation de rôles d'opinion (ORL). Cet ensemble de données comprend 97 169 instances de prédicat-argument, cruciales pour améliorer l'extraction d'opinions, notamment dans des scénarios à faibles ressources. Ce travail est significatif car il fournit une ressource réutilisable pour les chercheurs dans le domaine.

A new methodology for constructing a high-quality Semantic Role Labeling (SRL) dataset from the Wall Street Journal portion of the OntoNotes 5.0 corpus has been developed, adapting it for Opinion Role Labeling (ORL) tasks. This dataset includes 97,169 predicate-argument instances, crucial for enhancing opinion mining, especially in low-resource scenarios. The work is significant as it provides a reusable resource for researchers in the field.

From Semantic Roles to Opinion Roles: SRL Data Extraction for Multi-Task and Transfer Learning in Low-Resource ORL

Was this article worth reading? Share it

Ready to build your own newsroom?