ElasticMM: Efficient Multimodal LLMs Serving with Elastic Multimodal Parallelism

arXiv — cs.LG•Wednesday, November 12, 2025 at 5:00:00 AM

The recent publication of 'ElasticMM: Efficient Multimodal LLMs Serving with Elastic Multimodal Parallelism' marks a significant advancement in the field of artificial intelligence, particularly in the efficient serving of multimodal large language models (MLLMs). These models, which integrate various data types like images, videos, and audio, face challenges due to increased inference overhead and complex processing pipelines. The proposed Elastic Multimodal Parallelism (EMP) addresses these issues by dynamically allocating resources based on request types and inference stages. This innovative approach not only separates requests into independent modality groups but also decouples inference stages, allowing for adaptive scaling and improved resource utilization. As a result, ElasticMM demonstrates remarkable performance improvements, reducing time-to-first-token (TTFT) latency by up to 4.2 times and achieving 3.2 to 4.5 times higher throughput compared to state-of-the-art serving syst…

— via World Pulse Now AI Editorial System

Read Original

Was this article worth reading? Share it

Recommended Readings

arXiv — cs.CV3 days ago

AUVIC: Adversarial Unlearning of Visual Concepts for Multi-modal Large Language Models

PositiveArtificial Intelligence

The paper introduces AUVIC, a novel framework for adversarial unlearning of visual concepts in Multi-modal Large Language Models (MLLMs). This framework addresses data privacy concerns by enabling the removal of sensitive visual content without the need for extensive retraining. AUVIC utilizes adversarial perturbations to isolate target concepts while maintaining model performance on related entities. The study also presents VCUBench, a benchmark for evaluating the effectiveness of visual concept unlearning.

Read full article

via arXiv — cs.CV

arXiv — cs.CV3 days ago

VP-Bench: A Comprehensive Benchmark for Visual Prompting in Multimodal Large Language Models

PositiveArtificial Intelligence

VP-Bench is a newly introduced benchmark designed to evaluate the ability of multimodal large language models (MLLMs) to interpret visual prompts (VPs) in images. This benchmark addresses a significant gap in existing evaluations, as no systematic assessment of MLLMs' effectiveness in recognizing VPs has been conducted. VP-Bench utilizes a two-stage evaluation framework, involving 30,000 visualized prompts across eight shapes and 355 attribute combinations, to assess MLLMs' capabilities in VP perception and utilization.

Read full article

via arXiv — cs.CV

arXiv — cs.CL3 days ago

CyPortQA: Benchmarking Multimodal Large Language Models for Cyclone Preparedness in Port Operation

NeutralArtificial Intelligence

The article discusses CyPortQA, a new multimodal benchmark designed to enhance cyclone preparedness in U.S. port operations. As tropical cyclones become more intense and forecasts less certain, U.S. ports face increased supply-chain risks. CyPortQA integrates diverse forecast products, including wind maps and advisories, to provide actionable guidance. It compiles 2,917 real-world disruption scenarios from 2015 to 2023, covering 145 principal U.S. ports and 90 named storms, aiming to improve the accuracy and reliability of multimodal large language models (MLLMs) in this context.

Read full article

via arXiv — cs.CL

arXiv — cs.CV3 days ago

Hindsight Distillation Reasoning with Knowledge Encouragement Preference for Knowledge-based Visual Question Answering

PositiveArtificial Intelligence

The article presents a new framework called Hindsight Distilled Reasoning (HinD) with Knowledge Encouragement Preference Optimization (KEPO) aimed at enhancing Knowledge-based Visual Question Answering (KBVQA). This framework addresses the limitations of existing methods that rely on implicit reasoning in multimodal large language models (MLLMs). By prompting a 7B-size MLLM to complete reasoning processes, the framework aims to improve the integration of external knowledge in visual question answering tasks.

Read full article

via arXiv — cs.CV

arXiv — cs.CV3 days ago

MOSABench: Multi-Object Sentiment Analysis Benchmark for Evaluating Multimodal Large Language Models Understanding of Complex Image

PositiveArtificial Intelligence

MOSABench is a newly introduced evaluation dataset aimed at addressing the lack of standardized benchmarks for multi-object sentiment analysis in multimodal large language models (MLLMs). It comprises approximately 1,000 images featuring multiple objects, requiring MLLMs to evaluate the sentiment of each object independently. Key features of MOSABench include distance-based target annotation and an improved scoring mechanism, highlighting current limitations in MLLMs' performance in this complex task.

Read full article

via arXiv — cs.CV

arXiv — cs.CL3 days ago

DomainCQA: Crafting Knowledge-Intensive QA from Domain-Specific Charts

PositiveArtificial Intelligence

DomainCQA is a proposed framework aimed at enhancing Chart Question Answering (CQA) by focusing on both visual comprehension and knowledge-intensive reasoning. Current benchmarks primarily assess superficial parsing of chart data, neglecting deeper scientific reasoning. The framework has been applied to astronomy, resulting in AstroChart, which includes 1,690 QA pairs across 482 charts. This benchmark reveals significant weaknesses in fine-grained perception, numerical reasoning, and domain knowledge integration among 21 Multimodal Large Language Models (MLLMs).

Read full article

via arXiv — cs.CL

arXiv — cs.CV3 days ago

AirCopBench: A Benchmark for Multi-drone Collaborative Embodied Perception and Reasoning

NeutralArtificial Intelligence

AirCopBench is a new benchmark introduced to evaluate Multimodal Large Language Models (MLLMs) in multi-drone collaborative perception tasks. It addresses the lack of comprehensive evaluation tools for multi-agent systems, which outperform single-agent setups in terms of coverage and robustness. The benchmark includes over 14,600 questions across various task dimensions, such as Scene Understanding and Object Understanding, designed to assess performance under challenging conditions.

Read full article

via arXiv — cs.CV