Ground Truth Generation for Multilingual Historical NLP using LLMs

arXiv — cs.CL•Wednesday, November 19, 2025 at 5:00:00 AM

PositiveArtificial Intelligence

The research focuses on employing large language models to create ground
This development is crucial as it demonstrates that even small amounts of synthetic data can substantially enhance NLP tools, particularly for under

— via World Pulse Now AI Editorial System

Read Original

Was this article worth reading? Share it

Airparser

Extract and parse data from documents using GPT-4 automation.

AI & DataTry the app

Langtail

Build and deploy robust LLM applications quickly with your team.

Business & ProductivityTry the app

Sourcely

Find, cite, and write academic papers with AI-powered research assistance.

AI & DataTry the app

Continue Readings

Bloomberg Technology10 hours ago

Sequoia-Backed Pennylane Eyes Funding at $4.3 Billion Valuation

PositiveArtificial Intelligence

Pennylane, a French startup specializing in accounting software, is reportedly in discussions for a new funding round that could value the company at $4.25 billion, nearly double its previous valuation from just seven months ago.

Read full article

via Bloomberg Technology

arXiv — cs.CLa day ago

How Language Directions Align with Token Geometry in Multilingual LLMs

PositiveArtificial Intelligence

A recent study on multilingual large language models (LLMs) reveals that language information is distinctly organized within their internal representation space, particularly showing significant separation in the first transformer block. This comprehensive probing study analyzed six multilingual LLMs across all 268 transformer layers, utilizing both linear and nonlinear probes alongside a new Token-Language Alignment analysis.

Read full article

via arXiv — cs.CL

arXiv — cs.CLa day ago

Shona spaCy: A Morphological Analyzer for an Under-Resourced Bantu Language

PositiveArtificial Intelligence

A new open-source morphological analyzer for the Shona language, named Shona spaCy, has been developed using the spaCy framework. This tool integrates a curated JSON lexicon and linguistically grounded rules to enhance the analysis of noun-class prefixes, verbal subject concords, and other morphological features, achieving 90% accuracy in part-of-speech tagging and 88% in morphological features.

Read full article

via arXiv — cs.CL

arXiv — cs.CLa day ago

How Well Do LLMs Understand Tunisian Arabic?

NegativeArtificial Intelligence

A recent study highlights the limitations of Large Language Models (LLMs) in understanding Tunisian Arabic, also known as Tunizi. This research introduces a new dataset that includes parallel translations in Tunizi, standard Tunisian Arabic, and English, aiming to benchmark LLMs on their comprehension of this low-resource language. The findings indicate that the neglect of such dialects may hinder millions of Tunisians from engaging with AI in their native language.

Read full article

via arXiv — cs.CL

arXiv — cs.CLa day ago

MUCH: A Multilingual Claim Hallucination Benchmark

PositiveArtificial Intelligence

A new benchmark named MUCH has been introduced to assess Claim-level Uncertainty Quantification (UQ) in Large Language Models (LLMs). This benchmark includes 4,873 samples in English, French, Spanish, and German, and provides 24 generation logits per token, enhancing the evaluation of UQ methods under realistic conditions.

Read full article

via arXiv — cs.CL

arXiv — cs.CLa day ago

LangMark: A Multilingual Dataset for Automatic Post-Editing

PositiveArtificial Intelligence

LangMark has been introduced as a new multilingual dataset aimed at enhancing automatic post-editing (APE) for machine-translated texts, featuring 206,983 triplets across seven languages including Brazilian Portuguese, French, and Japanese. This dataset is human-annotated by expert linguists to improve translation quality and reduce reliance on human intervention.

Read full article

via arXiv — cs.CL