MatFormer: Nested Transformer for Elastic Inference

Paper · arXiv 2310.07707 · Published October 11, 2023
Test-Time ComputeNovel LLM ArchitecturesLLM Architecture

Foundation models are applied in a broad spectrum of settings with different inference constraints, from massive multi-accelerator clusters to resource-constrained standalone mobile devices. However, the substantial costs associated with training these models often limit the number of unique model sizes that can be offered. Consequently, practitioners are compelled to select a model that may not be optimally aligned with their specific latency and cost requirements. We present MatFormer2, a novel Transformer architecture designed to provide elastic inference across diverse deployment constraints. MatFormer achieves this by incorporating a nested Feed Forward Network (FFN) block structure within a standard Transformer model. During training, we optimize the parameters of multiple nested FFN blocks with varying sizes, enabling the extraction of hundreds of accurate smaller models without incurring additional computational costs. We empirically validate the efficacy of MatFormer across different model classes (decoders and encoders) and modalities (language and vision), demonstrating its potential for real-world deployment.

Introduction. Large Foundation models [49, 45, 17] are deployed in a variety of settings with different compute and accuracy demands like real-time response on mobile phones or on multi-cluster GPUs for web-scale batch serving. However, typical model families provide only a few independently trained models of different sizes. For example, the Llama-2 family provides models with 7B, 13B, 34B, and 70B parameters [59]. So practitioners are forced to choose a smaller (and typically less accurate) model than their latency/cost budget. Alternatively, one can use compression/pruning to fit a bigger model in a given compute budget [19, 36, 53], but that requires additional training. We introduce MatFormer, a natively elastic Transformer [61] architecture that overcomes this challenge. MatFormer allows for training one universal model which can be used to extract hundreds of smaller submodels without any additional training cost (Figure 1).

Discussion / Conclusion. In this work we presented MatFormer, a natively elastic Transformer architecture that allows training a single universal model which can be used to extract hundreds of smaller accurate submodels at zero additional cost at deployment time. We find that the MatFormer Language Model (MatLM) matches the perplexity & 1-shot accuracy of independently trained models. In fact, MatLM demonstrates an interesting loss-vs-compute scaling curve that is nearly independent of trained granularity indicating robust generalization to extremely large models as well. Finally, MatFormer submodels enable diverse inference time speedups like faster autoregressive generation with speculative decoding and elastic query encoders for adaptive dense retrieval across modalities. We believe dynamically routing these models to change inference latency [32, 40, 20], and developing the hardware optimizations required is a promising area for future work.