IndoPref: A Multi-Domain Pairwise Preference Dataset for Indonesian
PositiveArtificial Intelligence
The introduction of IndoPref marks a significant advancement in the representation of the Indonesian language in AI research, particularly for large language models (LLMs). With over 200 million speakers, Indonesian has been largely overlooked in preference-based datasets, which often rely on English translations that lack cultural authenticity. IndoPref addresses this gap by providing a fully human-authored dataset featuring 522 prompts and 4,099 pairwise preferences, all annotated in Indonesian. This dataset spans 10 diverse categories, allowing researchers to assess the strengths and weaknesses of various LLMs. The strong inter-annotator agreement, measured by Krippendorff's alpha, further underscores the dataset's reliability. By focusing on the naturalness and quality of LLM-generated text, IndoPref not only enhances the evaluation process for AI models but also promotes the linguistic richness of Indonesian, paving the way for more culturally relevant AI applications.
— via World Pulse Now AI Editorial System
