arXiv:2511.17593v1 Announce Type: new 
Abstract: The deployment of Large Language Models (LLMs) in production environments requires efficient inference serving systems that balance throughput, latency, and resource utilization. This paper presents a comprehensive empirical evaluation of two prominent open-source LLM serving frameworks: vLLM and HuggingFace Text Generation Inference (TGI). We benchmark these systems across multiple dimensions including throughput performance, end-to-end latency, GPU memory utilization, and scalability characteristics using LLaMA-2 models ranging from 7B to 70B parameters. Our experiments reveal that vLLM achieves up to 24x higher throughput than TGI under high-concurrency workloads through its novel PagedAttention mechanism, while TGI demonstrates lower tail latencies for interactive single-user scenarios. We provide detailed performance profiles for different deployment scenarios and offer practical recommendations for system selection based on workload characteristics. Our findings indicate that the choice between these frameworks should be guided by specific use-case requirements: vLLM excels in high-throughput batch processing scenarios, while TGI is better suited for latency-sensitive interactive applications with moderate concurrency.

دراسة حديثة تقيم أداء إطارين مفتوحين المصدر لخدمة نماذج اللغة الكبيرة (LLM)، وهما vLLM وHuggingFace Text Generation Inference (TGI)، مع التركيز على أدائهما من حيث الإنتاجية، والكمون، واستخدام الموارد عند نشر نماذج LLaMA-2. تشير النتائج إلى أن vLLM يمكن أن يحقق أداءً أعلى يصل إلى 24 مرة مقارنةً بـ TGI في ظروف الحمل العالي، بينما يتفوق TGI في الكمون المنخفض للتفاعلات مع المستخدمين الفرديين.

Un estudio reciente evalúa el rendimiento de dos marcos de servicio de Modelos de Lenguaje Grande (LLM) de código abierto, vLLM y HuggingFace Text Generation Inference (TGI), centrándose en su rendimiento de procesamiento, latencia y utilización de recursos al desplegar modelos LLaMA-2. Los hallazgos indican que vLLM puede alcanzar hasta 24 veces mayor rendimiento que TGI en condiciones de alta concurrencia, mientras que TGI sobresale en latencias más bajas para interacciones de usuario único.

Une étude récente évalue la performance de deux frameworks open-source de serving de modèles de langage de grande taille (LLM), vLLM et HuggingFace Text Generation Inference (TGI), en se concentrant sur leur débit, leur latence et leur utilisation des ressources lors du déploiement de modèles LLaMA-2. Les résultats montrent que vLLM peut atteindre jusqu'à 24 fois un débit supérieur à celui de TGI dans des conditions de forte concurrence, tandis que TGI excelle en matière de latences inférieures pour les interactions à utilisateur unique.

A recent study evaluates the performance of two open-source Large Language Model (LLM) serving frameworks, vLLM and HuggingFace Text Generation Inference (TGI), focusing on their throughput, latency, and resource utilization when deploying LLaMA-2 models. The findings indicate that vLLM can achieve up to 24 times higher throughput than TGI under high-concurrency conditions, while TGI excels in lower tail latencies for single-user interactions.

Comparative Analysis of Large Language Model Inference Serving Systems: A Performance Study of vLLM and HuggingFace TGI

arXiv:2511.17573v1 Announce Type: new 
Abstract: Sequence models for binary analysis are bottlenecked by byte-level tokenization: raw bytes waste precious context window capacity for transformers and other neural network architectures, and many existing text-oriented tokenizers fail on arbitrary 0x00--0xFF sequences. To address this issue, we introduce the Binary BPE tokenizer family, a set of cross-platform Byte Pair Encoding (BPE) tokenizers for executables trained on a large corpus of binaries spanning multiple platforms, architectures, and operating systems, including Linux, Windows, macOS, Android, and malware sources. We release trained tokenizers with vocabularies of 4K, 8K, 16K, 32K, and 64K tokens, enabling both systematic scaling studies and practical deployment from resource-constrained edge devices to high-throughput datacenters. These tokenizers discover interpretable patterns (ELF/PE headers, instruction sequences, cross-platform strings) while yielding multi-byte compression per token. On representative uncompressed executables (e.g., ELF/PE/Mach-O rather than compressed APKs), the Binary BPE tokenizers typically allow for roughly 2-3x more binary content per fixed-length transformer context window than raw bytes, enabling more efficient research and practical deployment for content identification, malware detection, reverse engineering, and optimization. We release the trained Binary BPE tokenizers on HuggingFace, providing a drop-in, open-source foundation for binary-focused language models and context-efficient agentic tools.

تم تقديم عائلة جديدة من أدوات التحليل الثنائية متعددة المنصات، تُعرف باسم Binary BPE، لمعالجة قيود التحليل على مستوى البايت في نماذج التسلسل. تم تدريب هذه الأدوات على مجموعة متنوعة من الملفات الثنائية من منصات مختلفة، بما في ذلك Linux وWindows وmacOS وAndroid، وتقدم مفردات تتراوح من 4K إلى 64K رمز، مما يعزز كفاءة التحليل الثنائي.

Se ha introducido una nueva familia de tokenizadores multiplataforma para el análisis binario, denominada Binary BPE, para abordar las limitaciones de la tokenización a nivel de byte en los modelos de secuencia. Estos tokenizadores, entrenados en un corpus diverso de binarios de varias plataformas, incluidas Linux, Windows, macOS y Android, ofrecen vocabularios que van de 4K a 64K tokens, mejorando la eficiencia del análisis binario.

Une nouvelle famille de tokenizers multiplateformes pour l'analyse binaire, nommée Binary BPE, a été introduite pour remédier aux limitations de la tokenisation au niveau des octets dans les modèles de séquence. Ces tokenizers, formés sur un corpus diversifié de binaires provenant de diverses plateformes, y compris Linux, Windows, macOS et Android, offrent des vocabulaires allant de 4K à 64K tokens, améliorant ainsi l'efficacité de l'analyse binaire.

A new family of cross-platform tokenizers for binary analysis, named Binary BPE, has been introduced to address the limitations of byte-level tokenization in sequence models. These tokenizers, trained on a diverse corpus of binaries from various platforms including Linux, Windows, macOS, and Android, offer vocabularies ranging from 4K to 64K tokens, enhancing the efficiency of binary analysis.

Binary BPE: A Family of Cross-Platform Tokenizers for Binary Analysis

arXiv:2511.17826v1 Announce Type: new 
Abstract: Deterministic inference is increasingly critical for large language model (LLM) applications such as LLM-as-a-judge evaluation, multi-agent systems, and Reinforcement Learning (RL). However, existing LLM serving frameworks exhibit non-deterministic behavior: identical inputs can yield different outputs when system configurations (e.g., tensor parallel (TP) size, batch size) vary, even under greedy decoding. This arises from the non-associativity of floating-point arithmetic and inconsistent reduction orders across GPUs. While prior work has addressed batch-size-related nondeterminism through batch-invariant kernels, determinism across different TP sizes remains an open problem, particularly in RL settings, where the training engine typically uses Fully Sharded Data Parallel (i.e., TP = 1) while the rollout engine relies on multi-GPU TP to maximize the inference throughput, creating a natural mismatch between the two. This precision mismatch problem may lead to suboptimal performance or even collapse for RL training. We identify and analyze the root causes of TP-induced inconsistency and propose Tree-Based Invariant Kernels (TBIK), a set of TP-invariant matrix multiplication and reduction primitives that guarantee bit-wise identical results regardless of TP size. Our key insight is to align intra- and inter-GPU reduction orders through a unified hierarchical binary tree structure. We implement these kernels in Triton and integrate them into vLLM and FSDP. Experiments confirm zero probability divergence and bit-wise reproducibility for deterministic inference across different TP sizes. Also, we achieve bit-wise identical results between vLLM and FSDP in RL training pipelines with different parallel strategy. Code is available at https://github.com/nanomaoli/llm_reproducibility.

قدمت دراسة جديدة إطارًا للعمل من أجل الاستدلال الحتمي عبر أحجام متوازية مختلفة للتنسور، مما يعالج مشكلة عدم التوافق بين التدريب والاستدلال في نماذج اللغة الكبيرة (LLMs). تنشأ هذه المشكلة من سلوكيات غير حتمية في الأطر الحالية لخدمة LLM، خاصة في إعدادات التعلم المعزز حيث يمكن أن تؤدي تكوينات مختلفة إلى نتائج غير متسقة.

Un nuevo estudio ha presentado un marco para la inferencia determinista a través de diferentes tamaños de paralelismo tensorial, abordando el problema de la desajuste entre el entrenamiento y la inferencia en los modelos de lenguaje de gran tamaño (LLMs). Este desajuste surge de comportamientos no deterministas en los marcos de servicio LLM existentes, especialmente en entornos de aprendizaje por refuerzo donde diferentes configuraciones pueden generar resultados inconsistentes.

Une nouvelle étude a introduit un cadre pour l'inférence déterministe à travers différentes tailles de parallèles tensoriels, abordant le problème de l'inadéquation entre l'entraînement et l'inférence dans les modèles de langage de grande taille (LLMs). Cette inadéquation découle des comportements non déterministes dans les cadres de service LLM existants, en particulier dans les contextes d'apprentissage par renforcement où différentes configurations peuvent produire des résultats incohérents.

A new study has introduced a framework for deterministic inference across varying tensor parallel sizes, addressing the issue of training-inference mismatch in large language models (LLMs). This mismatch arises from non-deterministic behaviors in existing LLM serving frameworks, particularly in reinforcement learning settings where different configurations can yield inconsistent outputs.

Comparative Analysis of Large Language Model Inference Serving Systems: A Performance Study of vLLM and HuggingFace TGI

Was this article worth reading? Share it

Humanize AI

Https

AI Humanizer