Evaluating LLMs' Reasoning Over Ordered Procedural Steps

arXiv — cs.LG•Tuesday, November 18, 2025 at 5:00:00 AM

NeutralArtificial Intelligence

The study investigates how large language models (LLMs) handle the reconstruction of ordered procedural sequences, particularly in food recipes where sequencing is crucial for success. The evaluation employs a curated dataset and various metrics to assess model performance under different conditions.
This research is significant as it sheds light on the limitations of LLMs in reasoning tasks, particularly as sequence length increases, which is vital for applications requiring precise procedural understanding.
The findings resonate with ongoing discussions about the reliability and adaptability of LLMs in reasoning tasks, highlighting a broader concern regarding their performance in complex scenarios and the need for improved frameworks to enhance their reasoning capabilities.

— via World Pulse Now AI Editorial System

Read Original

Was this article worth reading? Share it

Recommended Readings

arXiv — cs.LGa day ago

Codebook-Centric Deep Hashing: End-to-End Joint Learning of Semantic Hash Centers and Neural Hash Function

PositiveArtificial Intelligence

The article presents a novel approach to deep hashing called Center-Reassigned Hashing (CRH), which enhances traditional methods by dynamically reassigning hash centers from a preset codebook. This end-to-end framework optimizes the hash function while avoiding the inefficiencies of local similarity optimization and the complexities of two-stage methods. By adapting hash centers to data distribution without explicit optimization phases, CRH aims to improve performance and streamline the learning process in semantic hashing.

Read full article

via arXiv — cs.LG

arXiv — cs.LGa day ago

Linear time small coresets for k-mean clustering of segments with applications

PositiveArtificial Intelligence

This study addresses the k-means clustering problem for a set of segments in Euclidean space, focusing on finding k centers that minimize the total distance from each point along a segment to a center. The research introduces the first coreset construction that effectively handles arbitrary input segments, allowing for efficient computation in various contexts. The findings have implications for applications such as real-time video tracking and clustering in high-dimensional spaces.

Read full article

via arXiv — cs.LG

arXiv — cs.CLa day ago

QA-Noun: Representing Nominal Semantics via Natural Language Question-Answer Pairs

PositiveArtificial Intelligence

The paper titled 'QA-Noun: Representing Nominal Semantics via Natural Language Question-Answer Pairs' introduces a new framework called QA-Noun, aimed at capturing noun-centered semantic relations. This framework utilizes nine question templates to address both explicit and implicit roles of nouns, producing interpretable question-answer pairs that enhance existing verbal QA-SRL methods. The authors provide a dataset of over 2,000 annotated noun mentions and a trained model that integrates with QA-SRL, achieving significant coverage of noun arguments and revealing additional contextual relatio…

Read full article

via arXiv — cs.CL

arXiv — cs.LGa day ago

Nearest Neighbor Projection Removal Adversarial Training

PositiveArtificial Intelligence

Deep neural networks have shown remarkable capabilities in image classification but are susceptible to adversarial examples. Traditional adversarial training improves robustness but often overlooks inter-class feature overlap, which contributes to vulnerability. This study introduces a new adversarial training framework that reduces inter-class proximity by projecting out dependencies from both adversarial and clean samples in the feature space. The method enhances feature separability and theoretically lowers the Lipschitz constant of neural networks, improving generalization.

Read full article

via arXiv — cs.LG

arXiv — stat.MLa day ago

Silenced Biases: The Dark Side LLMs Learned to Refuse

NegativeArtificial Intelligence

Safety-aligned large language models (LLMs) are increasingly used in sensitive applications where fairness is crucial. Evaluating their fairness is complex, often relying on standard question-answer methods that misinterpret refusal responses as indicators of fairness. This paper introduces the concept of silenced biases, which are unfair preferences hidden within the models' latent space, masked by safety-alignment. Previous methods have limitations, prompting the need for new approaches to uncover these biases effectively.

Read full article

via arXiv — stat.ML

arXiv — cs.LGa day ago

On the Entropy Calibration of Language Models

NeutralArtificial Intelligence

The paper examines entropy calibration in language models, focusing on whether their entropy aligns with log loss on human text. Previous studies indicated that as text generation lengthens, entropy increases while text quality declines, highlighting a fundamental issue in autoregressive models. The authors investigate whether miscalibration can improve with scale and if calibration without tradeoffs is theoretically feasible, analyzing the scaling behavior concerning dataset size and power law exponents.

Read full article

via arXiv — cs.LG

arXiv — stat.MLa day ago

Larger Datasets Can Be Repeated More: A Theoretical Analysis of Multi-Epoch Scaling in Linear Regression

NeutralArtificial Intelligence

This paper presents a theoretical analysis of data scaling laws in linear regression, particularly focusing on the effects of training on limited datasets over multiple epochs. It investigates how much larger a dataset must be to achieve the same performance as training on a smaller dataset for multiple epochs. The study introduces the concept of the effective reuse rate, which quantifies the necessary dataset growth for one-pass training to match the test loss of multi-epoch training.

Read full article

via arXiv — stat.ML

arXiv — cs.LGa day ago

On the Limitations of Language Targeted Pruning: Investigating the Calibration Language Impact in Multilingual LLM Pruning

NeutralArtificial Intelligence

Recent advancements in large language model (LLM) pruning have demonstrated state-of-the-art compression results without the need for post-training or retraining, while still maintaining high predictive performance. However, prior research predominantly focused on English text for calibration, overlooking the multilingual capabilities of modern LLMs. This paper presents a comprehensive empirical study analyzing the effects of different calibration languages on pruning multilingual models, revealing significant insights into performance and internal representation changes.

Read full article

via arXiv — cs.LG