FineFreq: A Multilingual Character Frequency Dataset from Web-Scale Text

arXiv — cs.CLThursday, December 11, 2025 at 5:00:00 AM
  • FineFreq has been introduced as a large-scale multilingual character frequency dataset, derived from the FineWeb and FineWeb2 corpora, encompassing over 1900 languages and covering the period from 2013 to 2025. The dataset includes frequency counts for 96 trillion characters processed from 57 TB of compressed text, providing detailed per-character statistics and metadata.
  • This dataset is significant as it allows for fine-grained temporal analysis of character usage across multiple languages, preserving natural multilingual features such as cross-script borrowings and emojis, which can enhance linguistic research and applications in AI.
  • The development of FineFreq aligns with ongoing advancements in language processing technologies, emphasizing the importance of high-quality datasets for training language models. Innovations like the Length-MAX tokenizer and model-based extraction methods highlight the industry's focus on improving efficiency and accuracy in text representation and processing.
— via World Pulse Now AI Editorial System

Was this article worth reading? Share it

Recommended apps based on your readingExplore all apps
Continue Readings
Interpreto: An Explainability Library for Transformers
PositiveArtificial Intelligence
Interpreto has been launched as a Python library aimed at enhancing the explainability of text models developed by HuggingFace, including BERT and various large language models (LLMs). This library offers two main types of explanations: attributions and concept-based explanations, making it a valuable tool for data scientists seeking to provide clarity on model decisions.
Enhancing Reliability across Short and Long-Form QA via Reinforcement Learning
PositiveArtificial Intelligence
A new framework utilizing reinforcement learning (RL) has been introduced to enhance the reliability of large language models (LLMs) in both short and long-form question answering. This approach addresses the challenge of hallucinations, which can lead to inaccuracies in responses, by creating a targeted RL framework that mitigates both intrinsic and extrinsic hallucinations through innovative training sets and reward mechanisms.
Microsoft Tests Copilot-Powered Tool to Modernize JavaScript/TypeScript in VS Code
PositiveArtificial Intelligence
Microsoft has previewed a new tool in VS Code Insiders that leverages GitHub Copilot to modernize JavaScript and TypeScript applications by upgrading npm dependencies and addressing breaking changes. This initiative aims to enhance the development experience for programmers using these languages.
RAVES-Calib: Robust, Accurate and Versatile Extrinsic Self Calibration Using Optimal Geometric Features
PositiveArtificial Intelligence
A new LiDAR-camera calibration toolkit named RAVES-Calib has been introduced, allowing for robust and accurate extrinsic self-calibration using only a single pair of laser points and a camera image in targetless environments. This method enhances calibration accuracy by adaptively weighting feature costs based on their distribution, validated through extensive experiments across various sensors.
Empowering smart app development with SolidGPT: an edge-cloud hybrid AI agent framework
PositiveArtificial Intelligence
SolidGPT, an open-source edge-cloud hybrid AI agent framework, has been introduced to enhance mobile and software development workflows by integrating Large Language Models (LLMs) while addressing concerns of semantic awareness, developer productivity, and data privacy. This tool allows developers to interactively query their codebases and automate project workflows, significantly improving efficiency.
AraLingBench A Human-Annotated Benchmark for Evaluating Arabic Linguistic Capabilities of Large Language Models
NeutralArtificial Intelligence
AraLingBench has been introduced as a human-annotated benchmark aimed at evaluating the Arabic linguistic capabilities of large language models (LLMs), covering grammar, morphology, spelling, reading comprehension, and syntax through 150 expert-designed questions. The evaluation of 35 Arabic and bilingual LLMs indicates a disparity between high performance on knowledge-based benchmarks and true linguistic understanding, with many models relying on memorization rather than comprehension.
OMNIGUARD: An Efficient Approach for AI Safety Moderation Across Languages and Modalities
PositiveArtificial Intelligence
The introduction of Omniguard presents a novel approach to AI safety moderation by enhancing the detection of harmful prompts across various languages and modalities, addressing the vulnerabilities of large language models (LLMs) to misuse. This method improves classification accuracy by 11.57% over existing baselines, marking a significant advancement in AI safety protocols.
Guiding WaveMamba with Frequency Maps for Image Debanding
PositiveArtificial Intelligence
A new method for image debanding has been proposed, utilizing the Wavelet State Space Model and frequency masking maps to effectively reduce banding artifacts in images, particularly in smooth areas like skies. This technique has shown promising results in suppressing banding compared to existing methods, achieving a DBI value of 0.082 on the BAND-2k dataset.