GRAFT: GRaPH and Table Reasoning for Textual Alignment -- A Benchmark for Structured Instruction Following and Visual Reasoning

arXiv — cs.LG•Wednesday, December 3, 2025 at 5:00:00 AM

PositiveArtificial Intelligence

GRAFT has been introduced as a structured multimodal benchmark aimed at evaluating how well large language models (LLMs) can follow instructions, reason visually, and align text with visual data. The dataset includes programmatically generated charts and tables, each linked to multi-step analytical questions that require inference from the images alone. Responses are structured in formats like JSON or YAML for precise evaluation of reasoning and output adherence.
This development is significant as it provides a unified framework for assessing LLM capabilities in multimodal contexts, addressing the increasing demand for models that can integrate and process visual and textual information effectively. By establishing clear benchmarks, GRAFT aims to enhance the reliability and performance of LLMs in complex reasoning tasks.
The introduction of GRAFT reflects a broader trend in AI research focusing on improving reasoning capabilities across various modalities. This aligns with ongoing efforts to enhance LLMs' performance in tasks such as spatial reasoning and abstract thinking, as seen in recent advancements like SpatialGeo and AbstRaL. These developments highlight the importance of refining LLMs to handle diverse and intricate reasoning challenges, ultimately pushing the boundaries of AI applications.

— via World Pulse Now AI Editorial System

Read Original

Was this article worth reading? Share it

LucidQuery AI

Combines diffusion reasoning with autoregressive LLM for advanced AI analysis.

AI & DataView app details

File Architect

Build file structures from text outlines with customizable templates and quick imports.

Tech & Developer ToolsView app details

Airparser

Extract and parse data from documents using GPT-4 automation.

AI & DataView app details

ZeroGPT.org

Detect AI-generated text and check for plagiarism with accurate, reliable results.

AI & DataView app details

Guidejar-4eb95b

Build interactive product demos and help guides with AI assistance.

AI & DataView app details

Supametas.AI

Extract and structure unstructured data for seamless LLM RAG integration.

AI & DataView app details

Continue Readings

arXiv — cs.LG2 days ago

From Lab to Reality: A Practical Evaluation of Deep Learning Models and LLMs for Vulnerability Detection

NeutralArtificial Intelligence

A recent study evaluated the effectiveness of deep learning models and large language models (LLMs) for vulnerability detection, focusing on models like ReVeal and LineVul across four datasets: Juliet, Devign, BigVul, and ICVul. The research highlights the gap between benchmark performance and real-world applicability, emphasizing the need for systematic evaluation in practical scenarios.

Read full article

via arXiv — cs.LG

arXiv — cs.LG2 days ago

On the Design of KL-Regularized Policy Gradient Algorithms for LLM Reasoning

PositiveArtificial Intelligence

Recent advancements in KL-Regularized Policy Gradient algorithms have been proposed to enhance the reasoning capabilities of large language models (LLMs). The study introduces a unified derivation known as the Regularized Policy Gradient (RPG) view, which clarifies the necessary weighting for KL variants in off-policy settings, aiming to optimize the surrogate for the intended KL-regularized objective.

Read full article

via arXiv — cs.LG

arXiv — cs.CL2 days ago

Unforgotten Safety: Preserving Safety Alignment of Large Language Models with Continual Learning

PositiveArtificial Intelligence

A recent study highlights the importance of safety alignment in large language models (LLMs) as they are increasingly adapted for various tasks. The research identifies safety degradation during fine-tuning, attributing it to catastrophic forgetting, and proposes continual learning (CL) strategies to preserve safety. The evaluation of these strategies shows that they can effectively reduce attack success rates compared to standard fine-tuning methods.

Read full article

via arXiv — cs.CL

arXiv — cs.CL2 days ago

Textual Data Bias Detection and Mitigation - An Extensible Pipeline with Experimental Evaluation

NeutralArtificial Intelligence

A new study has introduced a comprehensive pipeline for detecting and mitigating biases in textual data used to train large language models (LLMs), addressing representation bias and stereotypes as mandated by regulations like the European AI Act. The proposed pipeline includes generating word lists, quantifying representation bias, and employing sociolinguistic filtering to mitigate stereotypes.

Read full article

via arXiv — cs.CL

arXiv — cs.CL2 days ago

Anthropocentric bias in language model evaluation

NeutralArtificial Intelligence

A recent study highlights the need to address anthropocentric biases in evaluating large language models (LLMs), identifying two overlooked types: auxiliary oversight and mechanistic chauvinism. These biases can hinder the accurate assessment of LLM cognitive capacities, necessitating a more nuanced evaluation approach.

Read full article

via arXiv — cs.CL

arXiv — cs.CL2 days ago

Cooperative Retrieval-Augmented Generation for Question Answering: Mutual Information Exchange and Ranking by Contrasting Layers

PositiveArtificial Intelligence

A novel framework named CoopRAG has been introduced to enhance question answering by enabling cooperative interactions between a retriever and a large language model (LLM). This approach aims to mitigate issues of factual inaccuracies and hallucinations that are common in existing retrieval-augmented generation (RAG) methods. By unrolling questions into sub-questions and utilizing a reasoning chain, CoopRAG seeks to improve the accuracy of document retrieval relevant to user queries.

Read full article

via arXiv — cs.CL

arXiv — cs.CL2 days ago

Can LLMs Detect Their Confabulations? Estimating Reliability in Uncertainty-Aware Language Models

NeutralArtificial Intelligence

A recent study investigates the reliability of Large Language Models (LLMs) in detecting their own confabulations, which are fluent but incorrect outputs. The research focuses on how in-context information affects model behavior and whether LLMs can recognize unreliable responses. By estimating token-level uncertainty, the study aims to enhance response-level reliability predictions through controlled experiments on open QA benchmarks.

Read full article

via arXiv — cs.CL

arXiv — cs.CL2 days ago

GFM-RAG: Graph Foundation Model for Retrieval Augmented Generation

PositiveArtificial Intelligence

GFM-RAG, a novel graph foundation model for retrieval augmented generation, has been introduced to enhance the integration of knowledge into large language models (LLMs). This model utilizes an innovative graph neural network to effectively capture complex relationships between queries and knowledge, addressing limitations faced by conventional retrieval-augmented generation systems.

Read full article

via arXiv — cs.CL

Ready to build your own newsroom?

Subscribe to unlock a personalised feed, podcasts, newsletters, and notifications tailored to the topics you actually care about