Influence Functions for Efficient Data Selection in Reasoning

arXiv — cs.LGTuesday, December 2, 2025 at 5:00:00 AM
  • A recent study has introduced influence functions as a method for efficient data selection in reasoning tasks, particularly for fine-tuning large language models (LLMs) on chain-of-thought (CoT) data. This approach aims to define data quality more effectively, moving beyond traditional heuristics like problem difficulty and trace length. Influence-based pruning has shown to outperform existing methods in math reasoning tasks.
  • This development is significant as it addresses the challenge of identifying high-quality data for training LLMs, which can lead to improved performance with smaller datasets. By utilizing influence functions, researchers can better understand the impact of individual examples on model accuracy, potentially transforming data selection strategies in AI.
  • The introduction of influence functions aligns with ongoing efforts to enhance reasoning capabilities in LLMs, as seen in various studies exploring adaptive reasoning lengths and multimodal reasoning. These advancements highlight a growing recognition of the importance of data quality and selection methods in optimizing AI performance, suggesting a shift towards more nuanced approaches in the field.
— via World Pulse Now AI Editorial System

Was this article worth reading? Share it

Recommended apps based on your readingExplore all apps
Continue Readings
LLMs choose friends and colleagues like people, researchers find
PositiveArtificial Intelligence
Researchers have found that large language models (LLMs) make decisions about networking and friendship in ways that closely resemble human behavior, both in synthetic simulations and real-world contexts. This suggests that LLMs can replicate social decision-making processes similar to those of people.
AI’s Wrong Answers Are Bad. Its Wrong Reasoning Is Worse
NegativeArtificial Intelligence
Recent studies reveal that while AI, particularly generative AI, has improved in accuracy, its flawed reasoning processes pose significant risks in critical sectors such as healthcare, law, and education. These findings highlight the need for a deeper understanding of AI's decision-making mechanisms.
An Interdisciplinary and Cross-Task Review on Missing Data Imputation
NeutralArtificial Intelligence
A comprehensive review on missing data imputation highlights the challenges posed by incomplete datasets across various fields, including healthcare and e-commerce. The study synthesizes decades of research, categorizing imputation methods from classical techniques to modern machine learning approaches, emphasizing the need for a unified framework to address missingness mechanisms and imputation goals.
Adaptive Margin RLHF via Preference over Preferences
PositiveArtificial Intelligence
A new approach in reinforcement learning from human feedback (RLHF) has been proposed, focusing on adaptive margin optimization through modeling preferences over preferences. This method aims to enhance generalization and robustness in classification tasks by addressing the limitations of existing margin-based optimization techniques, which often overlook the varying strengths of preferences.
Emergent Riemannian geometry over learning discrete computations on continuous manifolds
NeutralArtificial Intelligence
A recent study has revealed insights into how neural networks learn to perform discrete computations on continuous data manifolds, specifically through the lens of Riemannian geometry. The research indicates that as neural networks learn, they develop a representational geometry that allows for the discretization of continuous input features and the execution of logical operations on these features.
Challenges of Heterogeneity in Big Data: A Comparative Study of Classification in Large-Scale Structured and Unstructured Domains
NeutralArtificial Intelligence
A recent study investigates the challenges posed by heterogeneity in Big Data, focusing on classification strategies in both structured (Epsilon) and unstructured (Rest-Mex, IMDB) domains. Utilizing evolutionary and Bayesian optimization methods, the research highlights a 'complexity paradox' where simpler models often outperform complex ones in specific contexts.
Stay Unique, Stay Efficient: Preserving Model Personality in Multi-Task Merging
PositiveArtificial Intelligence
A new framework called Decomposition, Thresholding, and Scaling (DTS) has been proposed to enhance model merging for multi-task capabilities while preserving task-specific information. This method utilizes singular value decomposition to retain essential singular values and vectors, minimizing storage overhead and improving performance compared to traditional merging techniques.
From Topology to Retrieval: Decoding Embedding Spaces with Unified Signatures
PositiveArtificial Intelligence
A comprehensive analysis of text embedding models has been conducted, revealing the organization of embeddings in space and their impact on model interpretability and downstream task performance. The study introduces Unified Topological Signatures (UTS), a framework that characterizes embedding spaces and predicts model-specific properties, linking topological structure to document retrievability.