Efficient Nearest Neighbor Language Models

Paper · arXiv 2109.04212 · Published September 9, 2021
LLM Memory

Non-parametric neural language models (NLMs) learn predictive distributions of text utilizing an external datastore, which allows them to learn through explicitly memorizing the training datapoints. While effective, these models often require retrieval from a large datastore at test time, significantly increasing the inference overhead and thus limiting the deployment of non-parametric NLMs in practical applications. In this paper, we take the recently proposed k-nearest neighbors language model (Khandelwal et al., 2020) as an example, exploring methods to improve its efficiency along various dimensions. Experiments on the standard WikiText-103 benchmark and domain-adaptation datasets show that our methods are able to achieve up to a 6x speed-up in inference speed while retaining comparable performance. The empirical analysis we present may provide guidelines for future research seeking to develop or deploy more efficient non-parametric NLMs.1

Introduction. Language models (LMs) are one of the most fundamental technologies in NLP, with applications spanning text generation (Bahdanau et al., 2015; Rush et al., 2015), representation learning (Peters et al., 2018; Devlin et al., 2019; Yang et al., 2019), and few-shot learning (Radford et al., 2019; Brown et al., 2020). Modern neural language models (NLMs) based on recurrent (Mikolov et al., 2010; Sundermeyer et al., 2012) or selfattentional (Vaswani et al., 2017; Al-Rfou et al., 2019) neural networks are mostly parametric, where the predictions are solely dependent on the model parameters given the input data. In contrast, recent non-parametric LMs (Guu et al., 2018; Khandelwal et al., 2020; He et al.,

Discussion / Conclusion. In this paper, we explore several different ways to improve efficiencies of the k-nearest neighbors language model, achieving up to 6x speed-up while attaining comparable performance. As for future work, it is interesting to explore features from the datastore side to better know when to retrieve, and the gap between retrieval-based NLMs and parametric NLMs may be further reduced by combining more optimized indexing methods and the approaches in this paper.