arXiv:2510.27313v2 Announce Type: replace 
Abstract: Generation novelty is a key indicator of an LLM's ability to generalize, yet measuring it against full pretraining corpora is computationally challenging. Existing evaluations often rely on lexical overlap, failing to detect paraphrased text, or do not consider the full pretraining corpus. We frame novelty as a semantic retrieval problem. This framing enables us to address novelty with modern embedding and indexing pipelines, allowing for efficient analysis at pre-training scale. Specifically, we propose a three-stage framework that retrieves semantically similar samples, reranks them at varying subsequence lengths, and calibrates scores using a human novelty reference for interpretability. We apply this framework to the SmolLM model family and report three key findings: (1) models draw on pre-training data across much longer sequences than previously reported; (2) some task domains systematically promote or suppress generation novelty; and (3) instruction tuning not only alters style but also increases novelty. These results highlight the value of semantic novelty analysis for studying generalization. To support reproducibility and further research, we release ~20 TB of corpus chunks and index artifacts at https://huggingface.co/datasets/stai-tuebingen/faiss-smollm

قدمت دراسة حديثة إطارًا جديدًا لتقييم تجديد الجيل في نماذج اللغة الكبيرة (LLMs) من خلال تأطيره كمشكلة استرجاع دلالي. تتيح هذه الطريقة تحليلًا فعالًا لبيانات التدريب المسبق، مما يعالج قيود التقييمات الحالية التي تعتمد غالبًا على التداخل اللفظي. تم تطبيق الإطار على عائلة نماذج SmolLM، مما يكشف أن النماذج تستخدم تسلسلات أطول من بيانات التدريب المسبق مما تم الإبلاغ عنه سابقًا.

Un estudio reciente ha presentado un nuevo marco para evaluar la novedad de generación en modelos de lenguaje de gran tamaño (LLMs) al enmarcarlo como un problema de recuperación semántica. Este enfoque permite un análisis eficiente de los datos de preentrenamiento, abordando las limitaciones de las evaluaciones existentes que a menudo se basan en la superposición léxica. El marco se aplicó a la familia de modelos SmolLM, revelando que los modelos utilizan secuencias más largas de los datos de preentrenamiento de lo que se había informado anteriormente.

Une étude récente a introduit un nouveau cadre pour évaluer la nouveauté de génération dans les modèles de langage de grande taille (LLMs) en la considérant comme un problème de récupération sémantique. Cette approche permet une analyse efficace des données de pré-entraînement, répondant aux limites des évaluations existantes qui reposent souvent sur le chevauchement lexical. Le cadre a été appliqué à la famille de modèles SmolLM, révélant que les modèles utilisent des séquences plus longues des données de pré-entraînement que précédemment rapporté.

A recent study has introduced a novel framework for evaluating generation novelty in large language models (LLMs) by framing it as a semantic retrieval problem. This approach allows for efficient analysis of pre-training data, addressing the limitations of existing evaluations that often rely on lexical overlap. The framework was applied to the SmolLM model family, revealing that models utilize longer sequences from pre-training data than previously reported.

LLM generation novelty through the lens of semantic similarity

Was this article worth reading? Share it

Humanize AI

Magicley AI

LucidQuery AI

Semantic Pen

MemeGen AI

AiReelGenerator.com

Ready to build your own newsroom?