LLMs’ impact on science: Booming publications, stagnating quality

Ars Technica — AllThursday, December 18, 2025 at 8:54:36 PM
LLMs’ impact on science: Booming publications, stagnating quality
  • Recent studies indicate that the rise of large language models (LLMs) has led to an increase in the number of published research papers, yet the quality of these publications remains stagnant. Researchers are increasingly relying on LLMs for their work, which raises concerns about the depth and rigor of scientific inquiry.
  • This trend is troubling for the academic community, as the proliferation of low-quality research could undermine the credibility of scientific literature. The reliance on LLMs may result in a superficial understanding of complex topics, impacting the overall advancement of knowledge.
  • The issue is compounded by findings that LLMs trained on low-quality data, such as superficial tweets, exhibit poor performance on critical benchmarks. Additionally, their struggles in sensitive applications like mental health care highlight the limitations of current models, suggesting a need for more robust training methodologies and ethical considerations in their deployment.
— via World Pulse Now AI Editorial System

Was this article worth reading? Share it

Recommended apps based on your readingExplore all apps
Continue Readings
3DLLM-Mem: Long-Term Spatial-Temporal Memory for Embodied 3D Large Language Model
PositiveArtificial Intelligence
The introduction of 3DLLM-Mem marks a significant advancement in the capabilities of Large Language Models (LLMs) by integrating long-term spatial-temporal memory for enhanced reasoning in dynamic 3D environments. This model is evaluated using the 3DMem-Bench, which includes over 26,000 trajectories and 2,892 tasks designed to test memory utilization in complex scenarios.
INFORM-CT: INtegrating LLMs and VLMs FOR Incidental Findings Management in Abdominal CT
PositiveArtificial Intelligence
A novel framework named INFORM-CT has been proposed to enhance the management of incidental findings in abdominal CT scans by integrating large language models (LLMs) and vision-language models (VLMs). This approach automates the detection, classification, and reporting processes, significantly improving efficiency compared to traditional manual inspections by radiologists.
Adversarial versification in portuguese as a jailbreak operator in LLMs
NeutralArtificial Intelligence
Recent research indicates that versification of prompts serves as an effective adversarial mechanism against aligned large language models (LLMs), demonstrating that poetic instructions can lead to significantly higher safety failures compared to prose. The study highlights that manually crafted poems achieve an approximate 62% attack success rate, while automated versions reach about 43%, with some models exceeding 90% in single-turn interactions.
Evaluating Metrics for Safety with LLM-as-Judges
NeutralArtificial Intelligence
Large Language Models (LLMs) are being increasingly integrated into critical information processes, such as patient care and nuclear facility operations, raising concerns about their reliability and safety. The paper discusses the need for robust evaluation metrics to ensure LLMs can safely replace human roles in these contexts.
SoMe: A Realistic Benchmark for LLM-based Social Media Agents
NeutralArtificial Intelligence
A new benchmark called SoMe has been introduced to evaluate large language model (LLM)-based social media agents, addressing the need for comprehensive assessment of their capabilities in understanding media content and user behavior. SoMe includes 8 tasks, over 9 million posts, and nearly 7,000 user profiles, making it a significant resource for researchers and developers in the field of AI and social media.
Revisiting the Reliability of Language Models in Instruction-Following
NeutralArtificial Intelligence
Recent research highlights the limitations of advanced large language models (LLMs) in reliably following nuanced instructions, despite achieving high accuracy on benchmarks like IFEval. The study introduces a new metric, reliable@k, and reveals that performance can drop by up to 61.8% with subtle prompt variations.
TaP: A Taxonomy-Guided Framework for Automated and Scalable Preference Data Generation
PositiveArtificial Intelligence
The TaP framework has been introduced to automate and scale the generation of preference datasets for large language models (LLMs), addressing the challenges of resource-intensive dataset construction and the predominance of English datasets. This framework is based on a structured taxonomy that ensures diversity and comprehensive coverage in dataset composition.
LATTE: Learning Aligned Transactions and Textual Embeddings for Bank Clients
PositiveArtificial Intelligence
The paper presents LATTE, a novel contrastive learning framework designed to optimize the processing of historical communication sequences for bank clients by aligning raw event embeddings with semantic embeddings from large language models (LLMs). This approach significantly reduces computational costs and input sizes compared to traditional methods, making it more practical for real-world financial applications.

Ready to build your own newsroom?

Subscribe to unlock a personalised feed, podcasts, newsletters, and notifications tailored to the topics you actually care about