QoS-Efficient Serving of Multiple Mixture-of-Expert LLMs Using Partial Runtime Reconfiguration
The deployment of mixture-of-experts (MoE) large language models (LLMs) presents significant challenges due to their high memory demands. These challenges become even more pronounced in multi-tenant environments, where shared resources must accommodate multiple models, limiting the effectiveness of conventional virtualization techniques. This paper addresses the problem of efficiently serving multiple finetuned MoE-LLMs on a single-GPU. We propose a serving system that employs similarity-based expert consolidation to reduce the overall memory footprint by sharing similar experts across models. To ensure output quality, we introduce runtime partial reconfiguration, dynamically replacing non-expert layers when processing requests from different models. As a result, our approach achieves a competitive output quality while maintaining throughput comparable to serving a single model while incurring a negligible increase in time-to-first-token (TTFT). Experiments on a server with a single NVIDIA A100 GPU (80GB) using Mixtral-8x7B models demonstrate an 85% average reduction in turnaround time compared to NVIDIA’s multi-instance GPU (MIG).
Introduction. Incorporating the Mixture-of-Experts concept (Jacobs et al., 1991; Jordan & Jacobs, 1994) into transformer-based language models (He et al., 2024) has enabled an expansion in model parameters, improving the quality of generated outputs across various language tasks (Shazeer et al., 2017; 2018; Fedus et al., 2022; Du et al., 2022; Kim et al., 2021; Zuo et al., 2021; Lin et al., 2021a; Rajbhandari et al., 2022). Rather than using a single block (e.g., a feed-forward layer), MoE-based models employ a gating system and multiple parallel instances of the block (known as experts) with identical structures. During the forward pass calculation, based on the current input, the gating system assigns weights to each expert, and one or more top experts will be selected to produce the block’s output. This will allow the model to scale in size while imposing negligible computation overhead (in the case of selecting only one expert). However, such scaling comes with its own drawbacks.
Discussion / Conclusion. In this paper, we presented an efficient serving system designed to address the challenges of deploying multiple fine-tuned MoE-based LLMs in single-GPU resourceconstrained settings. By leveraging similarity-based consolidation, our approach reduced the overall memory footprint by sharing similar experts across models, and runtime partial reconfiguration ensured that non-expert layers are dynamically swapped to maintain competitive output quality. Our evaluation, conducted on a server with a single NVIDIA A100 GPU, demonstrated the system’s ability to significantly reduce turnaround time compared to conventional virtualization approaches like NVIDIA MIG which resulted in achieving a comparable throughput to single model serving, with only a minimal increase in TTFT. The results also highlighted that our approach maintained high output quality, as evidenced by strong performance across various benchmarks.