KernelBand: Boosting LLM-based Kernel Optimization with a Hierarchical and Hardware-aware Multi-armed Bandit

arXiv — cs.LG•Tuesday, November 25, 2025 at 5:00:00 AM

PositiveArtificial Intelligence

KernelBand has been introduced as a novel framework that enhances kernel optimization for Large Language Models (LLMs) by framing it as a hierarchical multi-armed bandit problem. This approach allows LLM agents to navigate the optimization space more effectively by utilizing hardware profiling information and runtime behavior clustering to streamline the selection and application of optimization strategies.
The significance of KernelBand lies in its potential to reduce the training and inference costs associated with LLMs, which have become increasingly complex and resource-intensive. By improving kernel optimization, this framework aims to make advanced LLM capabilities more accessible and efficient, thereby benefiting developers and researchers in the field.
This development reflects a broader trend in artificial intelligence where optimizing model performance and resource utilization is critical. As LLMs continue to evolve, the integration of adaptive strategies and hardware awareness in optimization processes is becoming essential, paralleling advancements in related areas such as reinforcement learning, pruning techniques, and adaptive tool recommendations.

— via World Pulse Now AI Editorial System

Read Original

Was this article worth reading? Share it

Brainactive

Accelerate your research with AI-powered insights at an affordable price.

Tech & Developer ToolsTry the app

Legion AI

Build, deploy, and scale AI agents to automate complex workflows and tasks.

AI & DataTry the app

metatable

Build, deploy, and scale your applications faster with AI-powered development tools.

Business & ProductivityTry the app

Continue Readings

arXiv — cs.CLa day ago

QiMeng-Kernel: Macro-Thinking Micro-Coding Paradigm for LLM-Based High-Performance GPU Kernel Generation

PositiveArtificial Intelligence

The QiMeng-Kernel framework introduces a Macro-Thinking Micro-Coding paradigm aimed at enhancing the generation of high-performance GPU kernels for AI and scientific computing. This approach addresses the challenges of correctness and efficiency in existing LLM-based methods by decoupling optimization strategies from implementation details, thereby improving both aspects significantly.

Read full article

via arXiv — cs.CL

arXiv — cs.LGa day ago

Automating Deception: Scalable Multi-Turn LLM Jailbreaks

NeutralArtificial Intelligence

A recent study has introduced an automated pipeline for generating large-scale, psychologically-grounded multi-turn jailbreak datasets for Large Language Models (LLMs). This approach leverages psychological principles like Foot-in-the-Door (FITD) to create a benchmark of 1,500 scenarios, revealing significant vulnerabilities in models, particularly those in the GPT family, when subjected to multi-turn conversational attacks.

Read full article

via arXiv — cs.LG

arXiv — cs.LGa day ago

Mosaic Pruning: A Hierarchical Framework for Generalizable Pruning of Mixture-of-Experts Models

PositiveArtificial Intelligence

A new framework called Mosaic Pruning (MoP) has been introduced to enhance the generalizability of Sparse Mixture-of-Experts (SMoE) models, addressing the limitations of existing pruning methods that often lead to performance degradation across different domains. MoP employs a structured 'cluster-then-select' process to create a comprehensive set of experts, significantly reducing the static memory overhead associated with loading all experts during inference.

Read full article

via arXiv — cs.LG

arXiv — cs.LGa day ago

Jailbreaking and Mitigation of Vulnerabilities in Large Language Models

PositiveArtificial Intelligence

Recent research has highlighted significant vulnerabilities in Large Language Models (LLMs), particularly concerning prompt injection and jailbreaking attacks. This review categorizes various attack methods and evaluates defense strategies, including prompt filtering and self-regulation, to mitigate these risks.

Read full article

via arXiv — cs.LG

arXiv — cs.LGa day ago

Understanding and Optimizing Multi-Stage AI Inference Pipelines

PositiveArtificial Intelligence

The introduction of HERMES, a Heterogeneous Multi-stage LLM inference Execution Simulator, marks a significant advancement in optimizing inference pipelines for Large Language Models (LLMs). This tool addresses the limitations of existing simulators by accurately modeling diverse request stages, including Retrieval Augmented Generation (RAG) and key-value cache retrieval, across complex hardware architectures.

Read full article

via arXiv — cs.LG

arXiv — cs.CLa day ago

Profile-LLM: Dynamic Profile Optimization for Realistic Personality Expression in LLMs

PositiveArtificial Intelligence

A new framework called PersonaPulse has been introduced to optimize prompts for Large Language Models (LLMs), enhancing their ability to express realistic personality traits. This approach iteratively refines role-play prompts while using a situational response benchmark for evaluation, demonstrating improved performance over previous methods based on psychological personality descriptions.

Read full article

via arXiv — cs.CL

arXiv — cs.LGa day ago

Mixture of Attention Spans: Optimizing LLM Inference Efficiency with Heterogeneous Sliding-Window Lengths

PositiveArtificial Intelligence

A new framework called Mixture of Attention Spans (MoA) has been proposed to enhance the efficiency of Large Language Models (LLMs) by optimizing inference through heterogeneous sliding-window lengths. This approach addresses the limitations of existing methods that use a uniform window length, which fails to capture the diverse attention patterns in LLMs, particularly in long-context scenarios.

Read full article

via arXiv — cs.LG

arXiv — cs.LGa day ago

A Systematic Study of Compression Ordering for Large Language Models

PositiveArtificial Intelligence

A systematic study has been conducted on compression ordering for large language models (LLMs), specifically focusing on the Qwen2.5 3B model. The research evaluates various compression techniques such as knowledge distillation, structured pruning, and low-bit quantization, analyzing their performance both independently and in combination. The findings indicate that quantization offers the highest standalone compression, while the sequence of techniques significantly impacts the final model quality.

Read full article

via arXiv — cs.LG