arXiv:2511.07423v1 Announce Type: cross 
Abstract: Large Language Models (LLMs) are becoming key components in various mobile operating systems, driving smart applications like interactive chatbots and personal assistants. While bringing enhanced intelligence to mobile ends, their deployment suffers from a set of performance challenges, especially the generation quality degradation and prolonged latency. Prior works have mainly relied on solutions of cloud offloading or on-device Small Language Models (SLMs). However, the former is usually limited by the communication bottleneck, and the latter sacrifices generation quality due to resource constraints. To mitigate these limitations, this paper proposes Synera, a device-cloud synergistic LLM serving system that applies an efficient SLM-LLM synergistic mechanism. Through empirical studies on LLM's unique computing characteristics, Synera identifies a set of underexplored optimization opportunities in device-cloud synergistic LLM inference, including offloading decisions, pipeline stalls, and batching bottlenecks. To translate them into enhanced performance, Synera introduces tailored designs of communication-efficient selective offloading, stall-free parallel inference, and scalable cloud batching. Extensive evaluations with real-world testbeds show that Synera enables 1.20-5.47x better generation quality against competitive baselines with on-par latency performance. Compared with existing cloud serving, Synera achieves 8.2-16.5% lower cloud serving cost on various benchmarks.

تم اقتراح Synera، وهو نظام خدمة LLM تآزري بين الجهاز والسحابة، لتحسين أداء نماذج اللغة الكبيرة (LLMs) في أنظمة التشغيل المحمولة. يتناول التحديات مثل تدهور جودة التوليد وزيادة زمن الاستجابة من خلال تحسين قرارات التحميل وكفاءة الاتصال. يُظهر Synera تحسينات ملحوظة، حيث يحقق جودة توليد أفضل من 1.20 إلى 5.47 مرة وأداء زمن استجابة مماثل للخدمات السحابية الحالية، مما يمثل تقدمًا ملحوظًا في تطبيقات الذكاء الاصطناعي.

Synera, un nuevo sistema de servicio LLM sinérgico entre dispositivo y nube, se propuso para mejorar el rendimiento de los grandes modelos de lenguaje (LLM) en sistemas operativos móviles. Aborda desafíos como la degradación de la calidad de generación y la latencia al optimizar decisiones de descarga y la eficiencia de la comunicación. Synera muestra mejoras significativas, logrando una calidad de generación de 1.20 a 5.47 veces mejor y un rendimiento de latencia comparable a los servicios en la nube existentes, marcando un avance notable en aplicaciones de IA.

Synera, un nouveau système de service LLM synergique entre appareil et cloud, a été proposé pour améliorer les performances des grands modèles de langage (LLM) dans les systèmes d'exploitation mobiles. Il aborde des défis tels que la dégradation de la qualité de génération et la latence en optimisant les décisions de déchargement et l'efficacité de la communication. Synera montre des améliorations significatives, atteignant une qualité de génération 1,20 à 5,47 fois meilleure et une latence comparable aux services cloud existants, marquant une avancée notable dans les applications d'IA.

Synera, a new device-cloud synergistic LLM serving system, was proposed to enhance the performance of large language models (LLMs) in mobile operating systems. It addresses challenges like generation quality degradation and latency by optimizing offloading decisions and communication efficiency. Synera demonstrates significant improvements, achieving 1.20-5.47x better generation quality and comparable latency to existing cloud services, marking a notable advancement in AI applications.

Synera: Synergistic LLM Serving across Device and Cloud at Scale

Was this article worth reading? Share it

Ready to build your own newsroom?