On the generalization of language models from in-context learning and finetuning: a controlled study

arXiv — cs.LGWednesday, November 12, 2025 at 5:00:00 AM
The study published on arXiv investigates the generalization of large language models, highlighting their impressive capabilities alongside significant limitations in generalization from fine-tuning. These models can fail to adapt to simple relational reversals or logical deductions, which can severely impact their reasoning abilities. In contrast, in-context learning (ICL) shows different inductive biases and more flexible generalization capabilities. The researchers constructed novel datasets to evaluate these differences, exposing pretrained models to controlled subsets of information through either ICL or fine-tuning. Their findings indicate that ICL can generalize various types of inferences more effectively than fine-tuning, emphasizing the need for further exploration in this area to improve the reasoning capabilities of language models.
— via World Pulse Now AI Editorial System

Was this article worth reading? Share it

Recommended Readings
Understanding InfoNCE: Transition Probability Matrix Induced Feature Clustering
PositiveArtificial Intelligence
The article discusses InfoNCE, a key objective in contrastive learning, which is vital for unsupervised representation learning in various domains such as vision, language, and graphs. The authors introduce a transition probability matrix to model data augmentation dynamics and propose a new loss function, Scaled Convergence InfoNCE (SC-InfoNCE), which allows for flexible control over feature similarity alignment. This work aims to enhance the theoretical understanding of InfoNCE and its practical applications in machine learning.
Optimal Self-Consistency for Efficient Reasoning with Large Language Models
PositiveArtificial Intelligence
The paper titled 'Optimal Self-Consistency for Efficient Reasoning with Large Language Models' presents a comprehensive analysis of self-consistency (SC) as a technique for enhancing performance in chain-of-thought reasoning. SC involves generating multiple responses from a large language model (LLM) and selecting the most frequent answer. The study addresses the high costs associated with SC when applied at scale and introduces Blend-ASC, a novel variant aimed at improving sample efficiency and scaling behavior.
On the Entropy Calibration of Language Models
NeutralArtificial Intelligence
The study on entropy calibration of language models investigates whether the entropy of a model's text generation aligns with its log loss on human text. Previous findings indicate that models often exhibit miscalibration, where entropy increases and text quality declines with longer generations. This paper explores whether scaling can improve miscalibration and if calibration can be achieved without trade-offs, focusing on the relationship between dataset size and miscalibration behavior.
Learn to Select: Exploring Label Distribution Divergence for In-Context Demonstration Selection in Text Classification
PositiveArtificial Intelligence
The article discusses a novel approach to in-context learning (ICL) for text classification, emphasizing the importance of selecting appropriate demonstrations. Traditional methods often prioritize semantic similarity, neglecting label distribution alignment, which can impact performance. The proposed method, TopK + Label Distribution Divergence (L2D), utilizes a fine-tuned BERT-like small language model to generate label distributions and assess their divergence. This dual focus aims to enhance the effectiveness of demonstration selection in large language models (LLMs).
Studies with impossible languages falsify LMs as models of human language
NeutralArtificial Intelligence
A study published on arXiv examines the learning capabilities of infants and language models (LMs) regarding attested versus impossible languages. The research indicates that both groups find attested languages easier to learn than those with unnatural structures. However, the findings reveal that LMs can learn many impossible languages as effectively as attested ones. The study suggests that the complexity of these languages, rather than their impossibility, contributes to the challenges faced by LMs, which lack the human inductive biases essential for language acquisition.
Thinker: Training LLMs in Hierarchical Thinking for Deep Search via Multi-Turn Interaction
PositiveArtificial Intelligence
The article presents Thinker, a hierarchical thinking model designed to enhance the reasoning capabilities of large language models (LLMs) through multi-turn interactions. Unlike previous methods that relied on end-to-end reinforcement learning without supervision, Thinker allows for a more structured reasoning process by breaking down complex problems into manageable sub-problems. Each sub-problem is represented in both natural language and logical functions, improving the coherence and rigor of the reasoning process.
Are language models rational? The case of coherence norms and belief revision
NeutralArtificial Intelligence
The paper titled 'Are language models rational? The case of coherence norms and belief revision' explores the application of rationality norms, specifically coherence norms, to language models. It distinguishes between logical coherence norms and those related to the strength of belief. The authors introduce the Minimal Assent Connection (MAC), a new framework for understanding credence in language models based on internal token probabilities. The findings suggest that while some language models adhere to these rational norms, others do not, raising important questions about AI behavior and safety.
Identifying and Analyzing Performance-Critical Tokens in Large Language Models
NeutralArtificial Intelligence
The paper titled 'Identifying and Analyzing Performance-Critical Tokens in Large Language Models' explores how large language models (LLMs) utilize in-context learning (ICL) for few-shot learning. It categorizes tokens in ICL prompts into content, stopword, and template tokens, aiming to identify those that significantly impact LLM performance. The study reveals that template and stopword tokens have a greater influence on performance than informative content tokens, challenging existing assumptions about human attention to informative words.