xLLM Technical Report


TL;DR

This paper proposes an intelligent and efficient large language model inference framework called xLLM. It adopts an innovative service-engine decoupled architecture and, through intelligent scheduling and system-level collaborative optimization, is designed specifically for high-performance, large-scale enterprise services, addressing core challenges such as mixed workloads, low resource utilization, and poor hardware adaptability.

Key Definitions

This paper introduces or deeply applies the following core concepts:

Current mainstream large language model inference frameworks face severe challenges in enterprise service scenarios.

The xLLM framework proposed in this paper aims to systematically address the above service-level and engine-level challenges, enabling efficient, intelligent, and reliable enterprise LLM inference services.

Method

The core design of the xLLM framework is a service-engine decoupled design. xLLM-Service is responsible for intelligent scheduling and resource management, while xLLM-Engine is responsible for efficiently executing inference computations.

Figure illustration

xLLM-Service

xLLM-Service is designed to achieve efficient, elastic, and highly available request scheduling and resource management. Its workflow is shown in the figure below and mainly includes request preprocessing, intelligent scheduling, and the resource layer.

Figure illustration

Its main innovations include:

Elastic Instance Pools

Instances in the cluster are divided into three elastic logical pools: the Prefill pool, the Decode pool, and the Encode pool designed for multimodal workloads. The instances themselves are stateless and can flexibly switch between different roles (such as handling Prefill or Decode tasks) according to the type of request being processed, without physical migration or restart, enabling dynamic resource scheduling.

Intelligent Scheduling Policies

The scheduling layer includes three core policies to address different scenarios:

Other Key Designs

xLLM-Engine

xLLM-Engine is responsible for executing the actual inference computations, fully squeezing hardware performance through coordinated system-level and algorithm-level optimizations.

System-level Optimizations

Algorithm-level Optimizations

Experimental Conclusions